Upload
hanh
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Developing Efficient Graphics Software. Developing Efficient Graphics Software. Intent of Course Identify application and hardware interaction Quantify and optimize interaction Identify efficient software structure Balance software and hardware system component use. - PowerPoint PPT Presentation
Citation preview
Developing Efficient Graphics Software
Developing Efficient Graphics SoftwareIntent of CourseIntent of Course• Identify application and hardware interactionIdentify application and hardware interaction
• Quantify and optimize interactionQuantify and optimize interaction
• Identify efficient software structureIdentify efficient software structure
• Balance software and hardware system component useBalance software and hardware system component use
Developing Efficient Graphics Software OutlineOutline• 1:35 Hardware and graphics architecture and performance1:35 Hardware and graphics architecture and performance
• 2:05 Software and System Performance2:05 Software and System Performance
• BreakBreak
• 2:55 Software profiling and performance analysis2:55 Software profiling and performance analysis
• 3:20 C/C++ language issues3:20 C/C++ language issues
• 3:50 Graphics techniques and algorithms3:50 Graphics techniques and algorithms
• 4:40 Performance Hints4:40 Performance Hints
Developing Efficient Graphics SoftwareSpeakers Speakers • Applications Consulting Engineers for SGI Applications Consulting Engineers for SGI
– optimizing, differentiating, graphicsoptimizing, differentiating, graphics
• Keith Cok, Bob Kuehne, Thomas True, Alan CommikeKeith Cok, Bob Kuehne, Thomas True, Alan Commike
Hardware & Graphics Architecture & Performance
Bob Kuehne, SGIBob Kuehne, SGI
Course OverviewWhy is your application drawing so slowly?Why is your application drawing so slowly?• Could actually be the graphicsCould actually be the graphics
• Could be the data traversalCould be the data traversal
• Could be something entirely differentCould be something entirely different
Tour GuidePlatform architecture & componentsPlatform architecture & components• CPUCPU
• MemoryMemory
• GraphicsGraphics
Graphics performanceGraphics performance• Measurements: triangle rate, fill rate, misc. Measurements: triangle rate, fill rate, misc.
• Reproduce & maximizeReproduce & maximize
Bottlenecks & BalanceBottlenecksBottlenecks• Find themFind them
• Eliminate them (sort of - move them around)Eliminate them (sort of - move them around)
BalanceBalance• Understand hardware architectureUnderstand hardware architecture
• Fully utilize hardwareFully utilize hardware
Yin & Yang
• ““Yin and yang are the two primal cosmic principles of the Yin and yang are the two primal cosmic principles of the universe”universe”
• ““The best state for everything in the universe is a state of The best state for everything in the universe is a state of harmony represented by a balance of yin and yang.”harmony represented by a balance of yin and yang.”
– Skeptics Dictionary -- http://skepdic.com/yinyang.htmlSkeptics Dictionary -- http://skepdic.com/yinyang.html
Write Once Run Everywhere?My application ran fast on that platform! Why is My application ran fast on that platform! Why is this one so slow?this one so slow?• Different platforms require different tuningDifferent platforms require different tuning
• Different platforms implement hardware differentlyDifferent platforms implement hardware differently
– Macro: Architecture & featuresMacro: Architecture & features
– Micro: Storage capacities, buffers, & cachesMicro: Storage capacities, buffers, & caches
– Effect: Bandwidth & latencyEffect: Bandwidth & latency
Definitions:Definitions:• Latency: time required to communicate a unit of dataLatency: time required to communicate a unit of data
• Bandwidth: data transferred per unit timeBandwidth: data transferred per unit time
Example:Example:• Latency bottleneck:Latency bottleneck:
• Bandwidth bottleneck:Bandwidth bottleneck:
Latency & Bandwidth
SS tt SS tt SS tt SS ttSS tt SS tt
tt tt ttSS tt tt tt : unit of time: unit of times: texture setup times: texture setup timet: texture download timet: texture download time
Platform: Software View
graphics
i/o
miscmemory
CPU
net
Platform: PCI, AGP
CPUCPU MemoryMemory
Dis
kD
isk
Net
Net
Gra
phic
sG
raph
ics
I/OI/O
PCIPCI
glueglue
CPUCPU
glueglue
MemoryMemory
Dis
kD
isk
Net
Net
Gra
phic
sG
raph
ics
I/OI/O
PCIPCI AGPAGP
PCIPCI
glueglue
Platform: UMA, Switched Hub
CPUCPU
glueglue
MemoryMemory CPUCPU MemoryMemory
Dis
kD
isk
Net
Net
Gra
phic
sG
raph
ics
I/OI/ODis
kD
isk
Net
Net
I/OI/O
UMAUMAG
raph
ics
Gra
phic
s
Platform: The PointsWhy learn about hardware?Why learn about hardware?• To understand how your app interacts with itTo understand how your app interacts with it
• To best utilize the hardwareTo best utilize the hardware
• Potentially can use extra hardware featuresPotentially can use extra hardware features
Where?Where?• Platform documentationPlatform documentation
• Talk with hardware vendorTalk with hardware vendor
CPU: OverviewCPU OperationCPU Operation• Data transferred from main memory to registersData transferred from main memory to registers
• CPU works on data in registersCPU works on data in registers
LatencyLatency• Registers: 0 (free)Registers: 0 (free)
• Level-1 (L1) cache: 1Level-1 (L1) cache: 1
• Level-2 (L2) cache: 10x L1 Level-2 (L2) cache: 10x L1
• Main memory: 100x L1Main memory: 100x L1
CPUCPU RR L1L1 L2L2 MainMainMemoryMemory
CPU, Cache, and MemoryCaches designed to exploit data localityCaches designed to exploit data locality• Temporal localityTemporal locality
• Spatial localitySpatial localityMainMain
MemoryMemory
CPUCPU
L1L1L2L2
RegistersRegisters
Memory: Cache & Logical Flow
In L1?In L1? In L2?In L2?In Register?In Register?
ComputeCompute Copy to L1Copy to L1(10)(10)
Copy toCopy toRegisterRegister
(1)(1)
Copy to L2Copy to L2(100)(100)
Memory: Cache & Physical Flow
CPUCPU
RegistersRegisters
PagePage
Main MemoryMain Memory L2 CacheL2 Cache L1 CacheL1 Cache
Memory: Allocation & Pools• List elements are often allocated as-neededList elements are often allocated as-needed
– This leads to spatial disparityThis leads to spatial disparity
• Mitigated by use of application memory managementMitigated by use of application memory management
– Bad: malloc, malloc, malloc, malloc, ...Bad: malloc, malloc, malloc, malloc, ...
– Good: pools - pool_init, pool_alloc, ...Good: pools - pool_init, pool_alloc, ...
• Graphics example:Graphics example:
– Vertices, normals, textures, etc.Vertices, normals, textures, etc.
Memory: Graphics! Vertex Arrays
Vertex Array Cache Behavior
Number of Array Vertices
Tim
e to
Tra
vers
e Platform 0 - InterleavedPlatform 0 - Non-interleavedPlatform 1 - InterleavedPlatform 1 - Non-interleaved
Graphics: Pipe
FIFOFIFO
xfxf lightlight clipclip rastrast fxfx fopsfops
xfxf: world to screen: world to screenlightlight: apply light: apply lightclipclip: clip to view: clip to view
rastrast: convert to pixels: convert to pixelsfxfx: apply texture, etc.: apply texture, etc.fopsfops: test pixel ops: test pixel ops
Graphics: Pipe & Akeley Taxonomy
• G - Generate geometric dataG - Generate geometric data
• T - Traverse data structuresT - Traverse data structures
• X - Transform primitives world to screenX - Transform primitives world to screen
• R - Rasterize triangles to pixelsR - Rasterize triangles to pixels
• D - Display framebuffer on output deviceD - Display framebuffer on output device
XX RR DDGG
TT
Graphics: Hardware4 types of hardware are common4 types of hardware are common• G-TXRD : all hardwareG-TXRD : all hardware
• GT-XRD :GT-XRD :
• GTX-RD :GTX-RD :
• GTXR-D : all softwareGTXR-D : all software
Graphics: PerformanceBenchmarksBenchmarks• ““Trust, but verify.” - an ex-presidentTrust, but verify.” - an ex-president
DefinitionsDefinitions• Triangle rate: speed at which primitives are transformed (X)Triangle rate: speed at which primitives are transformed (X)
• Fill rate: speed at which primitives are rasterized (R)Fill rate: speed at which primitives are rasterized (R)
– Depth complexity: number of times pixel filledDepth complexity: number of times pixel filled
CaveatsCaveats• Quantization, fastpathQuantization, fastpath
Graphics: Quantization• Frame Frame quantizationquantization is the result of swapbuffers occurring at is the result of swapbuffers occurring at
the next vertical retrace.the next vertical retrace.
– Necessary to avoid image artifacts such as tearingNecessary to avoid image artifacts such as tearing
• Example: 100Hz display refreshExample: 100Hz display refresh
Graphics: Quantization
100 Hz100 Hz
50 Hz50 Hz
50 Hz50 Hz
33 Hz33 Hz
tt00 tt11 tt22 tt33 tt44 tt55 tt44 tt66 tt77
no-sync 120 Hzno-sync 120 Hz
: : one graphics frameone graphics frame ttnn: : 1/100 second1/100 second
Graphics: FastpathDefinitionDefinition• Fastpath: the most optimized path through graphics Fastpath: the most optimized path through graphics
hardwarehardware
ExampleExample• fast path: float verts, float norms, AGBR textures, z-testfast path: float verts, float norms, AGBR textures, z-test
• less fast path: float verts, float norms, RGBA textures, z-testless fast path: float verts, float norms, RGBA textures, z-test
Graphics: Fastpath Example
• Fast path is often synonymous with ideal path.Fast path is often synonymous with ideal path.
– Real usage of graphics falls on a continuum.Real usage of graphics falls on a continuum.
• Must quantify what hardware can doMust quantify what hardware can do
– Quality & speedQuality & speed
Graphics: Fastpath Points
Fast pathFast path(hardware)(hardware)
Slow pathSlow path(software)(software)
SpeedSpeed QualityQualityWhere is your application?Where is your application?
Graphics Hardware: TestingDuplicate performance numbers simply:Duplicate performance numbers simply:• Good: build a simple test programGood: build a simple test program
• Better: glPerf - http://www.spec.orgBetter: glPerf - http://www.spec.org
Maximize performance in an app:Maximize performance in an app:• Good: Use fast API extensionsGood: Use fast API extensions
• Better: Create an “is-fast” test, use what is verified as fastBetter: Create an “is-fast” test, use what is verified as fast
Graphics Hardware: “Is-Fast”Test each platform to determine fast path Test each platform to determine fast path • Once, per-machine, test primitives and modesOnce, per-machine, test primitives and modes
– Vertex array format, texture format, display list, etc.Vertex array format, texture format, display list, etc.
• Store data in databaseStore data in database
– Detect hardware changes or time-to-liveDetect hardware changes or time-to-live
• Read data from database at startupRead data from database at startup
– Check database or re-generate dataCheck database or re-generate data
Graphics Hardware: “Is-Fast”Pseudo-codePseudo-code
If ( new_machine() || hardware_changed() ) {If ( new_machine() || hardware_changed() ) { test_interesting_modes();test_interesting_modes(); store_in_database();store_in_database(); }}else { else { // have database entry// have database entry get_performance_data_from_database();get_performance_data_from_database();}}
// use the modes & primitives that are ‘’fast’’ when rendering// use the modes & primitives that are ‘’fast’’ when rendering
Think Globally, Act LocallyThink globallyThink globally• Know the platforms & graphics hardwareKnow the platforms & graphics hardware
• Use hardware effectively in your appUse hardware effectively in your app
• Balance hardware utilizationBalance hardware utilization
Act locallyAct locally• Use in-cache dataUse in-cache data
• Understand hardware & graphics fastpathsUnderstand hardware & graphics fastpaths
• Balance quality vs. performanceBalance quality vs. performance
Software and System Performance
Thomas J. True, SGIThomas J. True, SGI
A Four Step Process
Quantify
System Evaluation
Graphics Analysis
Bottleneck Elimination
QuantifyCharacterizeCharacterize• Application SpaceApplication Space
• Primitive TypesPrimitive Types
• Primitive CountsPrimitive Counts
• Rendering CharacteristicsRendering Characteristics
• Frame RateFrame Rate
QuantifyCompareCompare
TriangleRate
Fill Rate
My Performance Ideal Performance
Examine System ConfigurationResourcesResources• MemoryMemory
• DiskDisk
SetupSetup• DisplayDisplay
• NetworkNetwork
Graphics AnalysisIdeal PerformanceIdeal Performance• Keep graphics pipeline full.Keep graphics pipeline full.
• 100% CPU utilization running application code.100% CPU utilization running application code.
• 100% graphics utilization.100% graphics utilization.
Graphics AnalysisGraphics BoundGraphics Bound
Acme Electronics
0 100
5030
10
40
209080
7060
0 100
5030
10
40
209080
7060
Graphics AnalysisGraphics BoundGraphics Bound• Graphics subsystem processes data slower than CPU can Graphics subsystem processes data slower than CPU can
feed it.feed it.
• Graphics subsystem issues an interrupt which causes the Graphics subsystem issues an interrupt which causes the CPU to stall.CPU to stall.
• Data processing within application stops until graphics Data processing within application stops until graphics subsystem can again accept data.subsystem can again accept data.
Graphics AnalysisGeometry LimitedGeometry Limited• Limited by the rate at which vertices can be transformed and Limited by the rate at which vertices can be transformed and
clipped.clipped.
Fill LimitedFill Limited• Limited by the rate at which transformed vertices can be Limited by the rate at which transformed vertices can be
rasterized.rasterized.
Graphics AnalysisCPU BoundCPU Bound
Acme Electronics
0 100
5030
10
40
209080
7060
0 100
5030
10
40
209080
7060
Graphics AnalysisCPU BoundCPU Bound• CPU at 100% utilization but can’t feed graphics fast enough.CPU at 100% utilization but can’t feed graphics fast enough.
• Graphics subsystem at less than 100% utilization.Graphics subsystem at less than 100% utilization.
• All CPU cycles consumed by data processing.All CPU cycles consumed by data processing.
Graphics AnalysisDetermination TechniquesDetermination Techniques• Remove graphics API calls.Remove graphics API calls.
• Shrink graphics window.Shrink graphics window.
• Reduce geometry processing requirements.Reduce geometry processing requirements.
• Use system monitoring tool.Use system monitoring tool.
Graphics AnalysisStart
Remove graphics API
calls
Graphics Performance
Problem
Graphics bound:?
Performance Problem Not
Graphics
Graphics bound: fill limited
Graphics bound: geometry limited
Remove rendering
calls
Fallen off fast path
Shrink graphics window
Reduce geometry
load
Use system monitoring
tool
Excessive or unexpected CPU
activity
= frame rate increase = no change in frame rate
Graphics AnalysisGraphics Architecture: GTXR-DGraphics Architecture: GTXR-D
Acme Electronics
Graphics AnalysisGraphics Architecture: GTXR-D Graphics Architecture: GTXR-D (aka Dumb Frame Buffer)(aka Dumb Frame Buffer) • CPU does everything.CPU does everything.
• Typically CPU bound.Typically CPU bound.
• To remedy, buy a “real” graphics board.To remedy, buy a “real” graphics board.
Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD
Acme Electronics
Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• Screen space operations performed by graphics.Screen space operations performed by graphics.
• Object-space to screen-space transform on host.Object-space to screen-space transform on host.
• Can easily become CPU bound.Can easily become CPU bound. ““Roughly 100 single-precision floating point operations are required to Roughly 100 single-precision floating point operations are required to
transform, light, clip test, project and map an object-space vertex to screen-transform, light, clip test, project and map an object-space vertex to screen-space.” - K. Akeley & T. Jermolukspace.” - K. Akeley & T. Jermoluk
• Beware of fast-path and slow-path issues.Beware of fast-path and slow-path issues.
Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• If Graphics Bound:If Graphics Bound:
– Reduce per-pixel operations.Reduce per-pixel operations.
– Reduce depth complexity.Reduce depth complexity.
– Use native-format data.Use native-format data.
Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• If CPU Bound:If CPU Bound:
– Reduce scene complexity.Reduce scene complexity.
– Use more efficient graphics algorithms.Use more efficient graphics algorithms.
Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD
Acme Electronics
Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• Transformation and rasterization performed by graphics.Transformation and rasterization performed by graphics.
• Can be CPU or graphics bound. Can be CPU or graphics bound.
• Beware of fast-path and slow-path issues.Beware of fast-path and slow-path issues.
• Subject to host bandwidth limitations.Subject to host bandwidth limitations.
Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• If Graphics Bound:If Graphics Bound:
– Move lighting back to CPU.Move lighting back to CPU.
– Use native data formats within application.Use native data formats within application.
– Use display lists or vertex arrays.Use display lists or vertex arrays.
– Use less expensive lighting modes.Use less expensive lighting modes.
Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• If CPU Bound:If CPU Bound:
– Move lighting from CPU to graphics subsystem.Move lighting from CPU to graphics subsystem.
– Do matrix operations in graphics hardware.Do matrix operations in graphics hardware.
– Profile in search of computational performance issues.Profile in search of computational performance issues.
Bottleneck EliminationBottlenecksBottlenecks
Bottleneck EliminationBottlenecksBottlenecks• Understanding, crucial to effective tuning.Understanding, crucial to effective tuning.
• Will always exist, tune to balance.Will always exist, tune to balance.
• Not always a bad thing.Not always a bad thing.
Bottleneck EliminationGraphicsGraphics• Use native graphics formats.Use native graphics formats.
• Remove excessive state changes.Remove excessive state changes.
• Package graphics primitives efficiently.Package graphics primitives efficiently.
• Use textures that fit in texture cache.Use textures that fit in texture cache.
• Don’t use unnecessary rendering modes.Don’t use unnecessary rendering modes.
• Decrease depth complexity.Decrease depth complexity.
• Cull out excessive geometry.Cull out excessive geometry.
Bottleneck EliminationMemoryMemory• Don’t allocate memory in rendering loop.Don’t allocate memory in rendering loop.
• Avoid copying and repackaging of graphics data.Avoid copying and repackaging of graphics data.
• Organize graphics data.Organize graphics data.
• Avoid memory fragmentation.Avoid memory fragmentation.
Bottleneck EliminationMemory Bandwidth and FragmentationMemory Bandwidth and Fragmentation
Independent Triangles
9 vertices: 504 bytes
Triangle Strip
5 vertices: 280 bytes
Vertex Array
5 vertices: 280 bytes
Vertex = RGBA+XYZW+XYZ+STR = 56 bytes
Bottleneck EliminationCode and LanguageCode and Language• Use native data types.Use native data types.
• Avoid contention for a single shared resource.Avoid contention for a single shared resource.
• Avoid application bottlenecks in non-graphics code.Avoid application bottlenecks in non-graphics code.
• Reduce API call overhead.Reduce API call overhead.
Bottleneck EliminationAPI Call OverheadAPI Call Overhead
Independent Triangles
(XYZW + RGBA + XYZ + STR) * 9 vertices: 36 function calls
Triangle Strips
(XYZW + RGBA + XYZ + STR) * 5 vertices: 20 function calls
Vertex Array
5 function calls
Display List
1 function call
ConclusionPerformance Tuning an Iterative ProcessPerformance Tuning an Iterative Process
Quantify
System Evaluation
Graphics Analysis
Bottleneck Elimination
ConclusionIt’s all about balance!It’s all about balance!
Profiling and Performance Analysis
Keith Cok, SGIKeith Cok, SGI
Profile and Performance Analysis
• Profiling points out code areas that take up most timeProfiling points out code areas that take up most time
• Imperative for well balanced applicationImperative for well balanced application
• Points out code and system bottlenecksPoints out code and system bottlenecks
Two Methods of Software ProfilingBasic block Basic block • A section of code that has one entry and one exitA section of code that has one entry and one exit
• Measures Measures ideal timeideal time
Statistical samplingStatistical sampling• Interrupts program execution and examines current locationInterrupts program execution and examines current location
• Measures Measures actual CPU cyclesactual CPU cycles spent executing a line of code spent executing a line of code
How Do You Profile Code?• Compile/link with compiler optimizations turned onCompile/link with compiler optimizations turned on
– cc foo.c -use_all_optimization_flagscc foo.c -use_all_optimization_flags .... ....
• Instrument the codeInstrument the code
– Unix: Unix: pixie foo.exepixie foo.exe -> foo.exe.pixie -> foo.exe.pixie
– Visual Studio: embedded in tool suiteVisual Studio: embedded in tool suite
• Run the application with relevant data setsRun the application with relevant data sets
– foo.exe.pixie - argsfoo.exe.pixie - args -> produces results data file -> produces results data file
Profiling: Finding the Hot SpotFunction list, in descending order by exclusive ideal time
excl.% cum.% instructions calls function (dso: file, line)
[1] 10.3% 10.3% 190583064 11484 GL_CreateSurfaceLightmap (foo: gl_rsurf.c, 1293)
[2] 8.9% 19.2% 173920781 3203 S_Update_ (foo: snd_dma.c, 848)
[3] 8.2% 27.4% 145950460 338787 R_RenderBrushPoly (foo: gl_rsurf.c, 641)
[4] 5.9% 33.3% 97798122 1975976 __sin (libm.so: sin.c, 194)
[5] 4.1% 37.4% 82310479 240 GL_LoadTexture (foo: gl_draw.c, 990)
[6] 3.4% 40.8% 50786176 1204269 __glMgrim_Begin (libGLcore.so: mgras_prim.c, 221)
[7] 3.2% 44.0% 58099072 16797 R_DrawAliasModel (foo: gl_rmain.c, 232)
[8] 3.1% 47.1% 53832546 290970 R_RecursiveWorldNode (foo: gl_rsurf.c, 894)
[9] 3.1% 50.2% 43855299 437627 R_CullBox (foo: gl_rlight.c, 313; compiled in gl_rmain.c)
[10] 2.8% 53.0% 44666700 30981 EmitWaterPolys (foo: gl_warp.c, 187)
Profiling: Fixing the Hot SpotWhat do you look for?What do you look for?• Common sub-expressions Common sub-expressions
• Loop invariant codeLoop invariant code
• Repeated pointer de-referencingRepeated pointer de-referencing
• Global variables and cache missesGlobal variables and cache misses
• ““Thin” loopsThin” loops
Profiling Example
// Code the old way// Code the old way // Code the new way// Code the new way19: 19: void old_loop() {void old_loop() { 27: 27: void new_loop () {void new_loop () {20: 20: sum = 0;sum = 0; 28: 28: sum = 0;sum = 0;21: 21: for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 29: 29: ii = NUM%4;ii = NUM%4;22: 22: sum += x[i];sum += x[i]; 30: 30: for (i=0; i < ii; i++)for (i=0; i < ii; i++)23: 23: printf("sum = %f\n",sum);printf("sum = %f\n",sum); 31: 31: sum +=x[I];sum +=x[I];24: 24: }} 32: 32: for (i = ii; i < NUM; i +=4) {for (i = ii; i < NUM; i +=4) {
33: 33: sum += x[i];sum += x[i];34: 34: sum += x[i+1];sum += x[i+1];35: 35: sum += x[i+2];sum += x[i+2];36: 36: sum += x[i+3];sum += x[i+3];37 : 37 : }}38: 38: printf(“ sum = %f\n”,sum);printf(“ sum = %f\n”,sum);39: 39: }}
Profiling Example: Profile Results
cycles instructions calls cycles instructions calls function (dso: file: line)function (dso: file: line)
[1] [1] 6160 6160 6168 6168 1 1 old_loopold_loop (blahdso.so: blahdso.c, 19) (blahdso.so: blahdso.c, 19)[2] [2] 4869 4869 8714 8714 1 1 setup_data (blahdso.so: blahdso.c, 11)setup_data (blahdso.so: blahdso.c, 11)
[1] [1] 4869 8714 4869 8714 1 1 setup_data (blahdso.so: blahdso.c, 11)setup_data (blahdso.so: blahdso.c, 11)
[2] [2] 4625 4891 4625 4891 1 1 new_loopnew_loop (blahdso.so: blahdso.c, 27) (blahdso.so: blahdso.c, 27)
Profile Example: Line AnalysisLine list, in descending order by timeLine list, in descending order by time------------------------------------------------------------------------------------------------------------ cycles invocations function (dso: file, line)cycles invocations function (dso: file, line)
4096 1024 4096 1024 old_loop old_loop sum += x[i];sum += x[i]; 2061 1024 2061 1024 old_loop old_loop for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 978 256 978 256 new_loop new_loop sum += x[i+3];sum += x[i+3]; 968 256 968 256 new_loop new_loop sum += x[i+2];sum += x[i+2]; 968 256 968 256 new_loop new_loop sum += x[i+1];sum += x[i+1]; 968 256 968 256 new_loop new_loop sum += x[i];sum += x[i]; 733 256 733 256 new_loop new_loop for (i = ii; i < NUM; i +=4)for (i = ii; i < NUM; i +=4) 7 1 7 1 new_loop new_loop ii = NUM%4;ii = NUM%4;
Profile and Performance AnalysisProfile Example: Visual C++/IntelProfile Example: Visual C++/Intel
Function Percent of HitFunction Percent of Hit Function Function Time(s) Run Time CountTime(s) Run Time Count ------------------------------------------------------------------------------------------------------------------------------------ 0.410 0.410 39.4 1 39.4 1 _old_loop _old_loop 0.249 23.9 1 _new_loop0.249 23.9 1 _new_loop
Statistical vs. Basic Block Profile
void void ijkijk_loop(){ // loops _loop(){ // loops kjikji and and ikjikj as well as well sum = 0;sum = 0; for (i=0;i<YNUM;i++)for (i=0;i<YNUM;i++) for (j=0;j<YNUM;j++)for (j=0;j<YNUM;j++) for (k=0;k<YNUM;k++)for (k=0;k<YNUM;k++) sum += y[i][j][k];sum += y[i][j][k];}} printf("sum = %f\n",sum); printf("sum = %f\n",sum);
Basic Block vs. Statistical SamplingBasic Block:Basic Block: Percent cycles inst calls function Percent cycles inst calls function [1] 25.3% 51141434 37101028 1 ijk_loop foo.c, 47[1] 25.3% 51141434 37101028 1 ijk_loop foo.c, 47 [2] 25.3% 51141434 37101028 1 kji_loop foo.c, 57[2] 25.3% 51141434 37101028 1 kji_loop foo.c, 57 [3] 25.3% 51141434 37101028 1 ikj_loop foo.c, 66[3] 25.3% 51141434 37101028 1 ikj_loop foo.c, 66
Statistical Sampling:Statistical Sampling: Percent Samples Procedure FunctionPercent Samples Procedure Function [1] 38.0% 2700 kji_loop foo.c, 57[1] 38.0% 2700 kji_loop foo.c, 57 [2] 23.9% 1700 setup_data foo.c, 15[2] 23.9% 1700 setup_data foo.c, 15 [3] 19.7% 1400 ikj_loop foo.c, 66[3] 19.7% 1400 ikj_loop foo.c, 66 [4] 18.3% 1300 ijk_loop foo.c, 47[4] 18.3% 1300 ijk_loop foo.c, 47
Now We Know About Hot Spots...What do we do next?What do we do next?• Use compilers to fine-tune codeUse compilers to fine-tune code
• Use knowledge of language to optimizeUse knowledge of language to optimize
• Hand-tune codeHand-tune code
Profiling is fun, hard, and iterative and it can be Profiling is fun, hard, and iterative and it can be highly effectivehighly effective
Compiler and Language Issues
Keith Cok, SGI Keith Cok, SGI Bob Kuehne, SGIBob Kuehne, SGI
Compiler and Language IssuesCompiler Optimizations:Compiler Optimizations:• Occur within a compromise ofOccur within a compromise of
speed and memory spacespeed and memory space vs.vs. time to compile and linktime to compile and link• An iterative process to discover what does and doesn’t workAn iterative process to discover what does and doesn’t work
• Important to keep at itImportant to keep at it
Compiler Issues: Trade-Offs• Trade-offs:Trade-offs:
– Round-off vs. needed precisionRound-off vs. needed precision
– Inter-procedural analysis vs. link timeInter-procedural analysis vs. link time
– Pointer aliasing vs. coding constraintsPointer aliasing vs. coding constraints
– Optimizing for processor architectures vs. work of multiple Optimizing for processor architectures vs. work of multiple binaries (support, test)binaries (support, test)
• Explore other compilers than your first choiceExplore other compilers than your first choice
• Different source code - different flagsDifferent source code - different flags
Compiler and Language IssuesComments on 32 vs. 64 bit codeComments on 32 vs. 64 bit code• Benefits of 64 bit code:Benefits of 64 bit code:
– Increased address spaceIncreased address space– Higher precisionHigher precision
• Downsides of 64 bit code:Downsides of 64 bit code:– Application memory footprintApplication memory footprint– Need to port which can be difficult!Need to port which can be difficult!
• Performance issuesPerformance issues
Language Issues
• Data ManagementData Management
• Unrolling loopsUnrolling loops
• ArraysArrays
• Temporary variablesTemporary variables
• Pointer aliasingPointer aliasing
Language Issues: Data ManagementManipulate data structures efficiently since Manipulate data structures efficiently since graphics IS datagraphics IS data
struct { str *next;struct { str *next; struct { str *next;struct { str *next; str *prev;str *prev; str *prev; str *prev; large_type foo;large_type foo; int key; int key;
int key;int key; large_type foo; large_type foo; } str;} str; } str; } str;
Language Issues: Data Management
Pack data efficientlyPack data efficientlystruct foo {struct foo { struct foo_better { struct foo_better { char aa; char aa; // 8 bits + 24 pad // 8 bits + 24 pad float bb; // 32 bitsfloat bb; // 32 bits float bb; float bb; // 32 bits// 32 bits char aa; // 8 bitschar aa; // 8 bits char cc;char cc; // 8 bits + 24 pad // 8 bits + 24 pad char cc; // 8 bitschar cc; // 8 bits float dd; float dd; // 32 bits// 32 bits char ee; // 8 bits + 8 padchar ee; // 8 bits + 8 pad char ee;char ee; // 8 bits + 24 pad // 8 bits + 24 pad float dd; // 32 bitsfloat dd; // 32 bits} foo_t; } foo_t; // // 160160 bits bits } foo_t; } foo_t; // // 9696 bits bits
Language Issues: Data ManagementExamine your arrays and note their caching Examine your arrays and note their caching behaviorbehavior• Break up large arrays into smaller sub-arrays for better Break up large arrays into smaller sub-arrays for better
memory access patternsmemory access patterns
• Understand the implications of data layout and cache Understand the implications of data layout and cache behaviorbehavior
Language Issues: Loop UnrollingProfiling ExampleProfiling Example// Code the old way// Code the old way // Code the new way// Code the new way19: 19: void old_loop() {void old_loop() { 27: 27: void new_loop() {void new_loop() {20: 20: sum = 0;sum = 0; 28: 28: sum = 0;sum = 0;21: 21: for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 29: 29: ii = NUM%4;ii = NUM%4;22: 22: sum += x[i];sum += x[i]; 30: 30: for (i=0; i < ii; i++)for (i=0; i < ii; i++)23: 23: printf("sum = %f\n",sum);printf("sum = %f\n",sum); 31: 31: sum +=x[i];sum +=x[i];24: 24: }} 32: 32: for (i=ii; i<NUM; ifor (i=ii; i<NUM; i +=4) {+=4) {
33: 33: sum += x[i];sum += x[i];34: 34: sum += x[i+1];sum += x[i+1];35: 35: sum += x[i+2];sum += x[i+2];36: 36: sum += x[i+3];sum += x[i+3];37: 37: }}38: 38: printf(“ sum = %f\n”,sum);printf(“ sum = %f\n”,sum);39: 39: }}
Language Issues: Loop UnrollingProfile Example: Line AnalysisProfile Example: Line AnalysisLine list, in descending order by timeLine list, in descending order by time------------------------------------------------------------------------------------------------------------ cycles invocations function cycles invocations function 4096 1024 4096 1024 old_loop old_loop sum += x[i];sum += x[i]; 2061 1024 2061 1024 old_loop old_loop for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 978 256 978 256 new_loop new_loop sum += x[i+3];sum += x[i+3]; 968 256 968 256 new_loop new_loop sum += x[i+2];sum += x[i+2]; 968 256 968 256 new_loop new_loop sum += x[i+1];sum += x[i+1]; 968 256 968 256 new_loop new_loop sum += x[i];sum += x[i]; 733 256 733 256 new_loop new_loop for (i = ii; i < NUM; i +=4)for (i = ii; i < NUM; i +=4) 7 1 7 1 new_loop new_loop ii = NUM%4;ii = NUM%4;
Language Issues: Loop UnrollingIssues with loop unrolling:Issues with loop unrolling:• Code complexityCode complexity• ClutterClutter• Compiler may/may not do thisCompiler may/may not do this• Flags may affect compiler time spent optimizingFlags may affect compiler time spent optimizing
Only “thin” loops gain performanceOnly “thin” loops gain performanceUse application knowledge to take advantage of Use application knowledge to take advantage of loop unrollingloop unrolling
Language Issues: Local temporary variablesUse local temporary variables to avoid repeatedly Use local temporary variables to avoid repeatedly de-referencing a pointer structurede-referencing a pointer structureExample:Example:
x = global_ptr->record_str->a;x = global_ptr->record_str->a;y = global_ptr->record_str->b;y = global_ptr->record_str->b;
Use:Use:tmptmp = global_ptr->record_str; = global_ptr->record_str;x = x = tmptmp->a;->a;y = y = tmptmp->b;->b;
Language Issues: Using tmp vars for global vars within a functionvoid tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt)void tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt)
FLOAT *c1, *c2, *c3, *c4, *op, *np, FLOAT *c1, *c2, *c3, *c4, *op, *np, tmptmp;;
c1 = m; c2 = m+4; c3 = m+8; c4 = m+12;c1 = m; c2 = m+4; c3 = m+8; c4 = m+12;for (j=0, np = new_pt;j<4; j++) { for (j=0; np = new_pt; j<4;j++) for (j=0, np = new_pt;j<4; j++) { for (j=0; np = new_pt; j<4;j++) op = old_pt; op = old_pt;op = old_pt; op = old_pt;
tmptmp += *op++ * *c1++; *np += *op++ * *c1++; += *op++ * *c1++; *np += *op++ * *c1++;tmptmp += *op++ * *c2++; *np += *op++ * *c2++; += *op++ * *c2++; *np += *op++ * *c2++;tmp tmp += *op++ * *c3++; *np += *op++ * *c3++; += *op++ * *c3++; *np += *op++ * *c3++;*np++ = *np++ = tmptmp + (*op * *c4++); } *np++ = *op++ * *c4++; } + (*op * *c4++); } *np++ = *op++ * *c4++; }
Language Issues: Pointer Aliasing
• Pointers are aliases when they point to potentially Pointers are aliases when they point to potentially overlapping regions of memoryoverlapping regions of memory
• If regions never overlap, may optimize for this case. Not If regions never overlap, may optimize for this case. Not possible, though, in generalpossible, though, in general
• Compiler can't tell when pointers are aliasedCompiler can't tell when pointers are aliased
• Use Use restrict restrict key word or compiler optionkey word or compiler option
Language Issues: Pointer Aliasing
in out
in out
Unaliased Pointers Compilers may use: - Parallelism - Pipelining
Aliased pointers
Language Issues: Pointer Aliasing
void process_data( float * void process_data( float * restrictrestrict in, in, float * float * restrictrestrict out, out,
float gain) {float gain) {int i;int i;for (i = 0; i < NSAMPS; i++) {for (i = 0; i < NSAMPS; i++) {
out[i] = in[i] * gain;out[i] = in[i] * gain;}}
}}
C++: General Issues• Language featuresLanguage features
– RTTI, safe casts, etc.RTTI, safe casts, etc.
• Use const, mutable, volatile, & inline Use const, mutable, volatile, & inline
– hints to compilershints to compilers
• Object constructionObject construction
– arrays, default constructors, arguments, etc.arrays, default constructors, arguments, etc.
• Method invocation issuesMethod invocation issues
– operators, overloads, conversion, etc.operators, overloads, conversion, etc.
C++: Virtual Functions• Good - used to invoke child method when managing base-Good - used to invoke child method when managing base-
class handlesclass handles
• Expensive - incur an additional pointer de-referenceExpensive - incur an additional pointer de-reference
– one, find VTBL, two, find method, invokeone, find VTBL, two, find method, invoke
– bad for cachingbad for caching
• Use when necessary, but not for common objectsUse when necessary, but not for common objects
– Good for ‘large’ methods that do lots of workGood for ‘large’ methods that do lots of work
– Bad for ‘small’ methods, like a vertex queryBad for ‘small’ methods, like a vertex query
C++: Exceptions & TemplatesExceptionsExceptions• Great for error checkingGreat for error checking
• Performance penaltyPerformance penalty
– Additional stack information requiredAdditional stack information required
TemplatesTemplates• Great for code re-useGreat for code re-use
• Memory penaltyMemory penalty
– Across libraries, across object filesAcross libraries, across object files
Code & Language Issues: The EndBalanceBalance
• Know your compilerKnow your compiler
– Features & performanceFeatures & performance
• Know your languageKnow your language
– Features & performanceFeatures & performance
• Know your appKnow your app
– Features & performanceFeatures & performance
Idioms and Application Architectures
Alan Commike, SGIAlan Commike, SGI
Starting Quote
The best tuned most efficient bubble sort is still a The best tuned most efficient bubble sort is still a bubble sort. Additional tweaking won't improve bubble sort. Additional tweaking won't improve performance.performance.
Change The Algorithm!Change The Algorithm! - Commike ‘99
IntroductionTo write an efficient graphics application, one To write an efficient graphics application, one must:must:• Understand the platformUnderstand the platform
• Use graphics efficientlyUse graphics efficiently
• Write good codeWrite good code
Use efficient application structures and algorithmsUse efficient application structures and algorithms
Outline• OutlineOutline
• BackgroundBackground
• CullingCulling
• Level of Detail (LOD) managementLevel of Detail (LOD) management
• Application architecturesApplication architectures
Application Architectures:Rendering Path• Application work, culling, LOD, drawingApplication work, culling, LOD, drawing
• Pipelined rendering pathPipelined rendering path
AppApp CullCull LODLOD DrawDraw
Application Architectures:Rendering Path• Application work, culling, LOD, drawingApplication work, culling, LOD, drawing
• Pipelined rendering pathPipelined rendering path
AppApp CullCull LODLOD DrawDraw
AppApp CullCull LODLOD DrawDraw
Application Architectures:Rendering Path• Application work, culling, LOD, drawingApplication work, culling, LOD, drawing
• Pipelined rendering pathPipelined rendering path
AppApp CullCull LODLOD DrawDraw
AppApp CullCull LODLOD DrawDraw
AppApp CullCull LODLOD DrawDraw
TT00 TT11 TT22 TT33 TT44 TT55
FrameFrame00
FrameFrame11
FrameFrame22
Application Architectures:Target Frame RateA target frame rate attempts to bound the A target frame rate attempts to bound the maximum render timemaximum render time• Control Culling and LOD aggressivenessControl Culling and LOD aggressiveness
• Maintain a constant frame rateMaintain a constant frame rate
• Achieve an acceptable interactive frame rateAchieve an acceptable interactive frame rate
Graphics Idioms• Culling Culling
– Removing geometry that isn't visibleRemoving geometry that isn't visible
• Level of Detail Management Level of Detail Management
– Reducing geometric complexityReducing geometric complexity
Culling
Don’t draw what you can’t seeDon’t draw what you can’t see
Culling:Culling TypesUse one. Use all. Pipeline them together.Use one. Use all. Pipeline them together.• View Frustum CullingView Frustum Culling
• Backface CullingBackface Culling
• Contribution CullingContribution Culling
• Occlusion CullingOcclusion Culling
Culling:Bounding VolumesTest against a bounding volume not individual Test against a bounding volume not individual primitivesprimitives• Can be bounding sphere, box, oriented box, or any enclosing Can be bounding sphere, box, oriented box, or any enclosing
volumevolume
• Hierarchical bounding volumes to reduce cull timeHierarchical bounding volumes to reduce cull time
• Spheres are fast, boxes are more accurateSpheres are fast, boxes are more accurate
– Use a combination of both Use a combination of both
Culling: View FrustumGraphics pipeline clips data that falls outside the Graphics pipeline clips data that falls outside the View FrustumView Frustum
If it will be clipped don’t bother drawingIf it will be clipped don’t bother drawing
Culling: View Frustum Usefulness• Improves geometry rate Improves geometry rate
– Culled vertices are not transformed, lit, and clippedCulled vertices are not transformed, lit, and clipped
• Improves host download rateImproves host download rate
– Less data moved from memory into graphics Less data moved from memory into graphics
• Does not change fill rateDoes not change fill rate
– Triangles outside the View Frustum would not have been Triangles outside the View Frustum would not have been drawn anywaydrawn anyway
Culling: View Frustum Implementation• Transform vertices to clip coordinates (in OpenGL multiply by Transform vertices to clip coordinates (in OpenGL multiply by
Model-View and Projection matrix)Model-View and Projection matrix)
• Check each vertex against View FrustumCheck each vertex against View Frustum
• Geometry is either Geometry is either In, , Out, or , or PartialPartial
• Render Render InIn and and PartialPartial
Culling: Skip the ClipIn software transform systems (GTX-RD) skip the In software transform systems (GTX-RD) skip the clipclip• PartialPartial and and InIn geometry classified geometry classified
– Pipe renders Pipe renders PartialPartial as usual as usual
– Pipe can render Pipe can render InIn without a View Frustum clip without a View Frustum clip
• Might be a hint to renderMight be a hint to render
• Can improve geometry rates if not already fill-limitedCan improve geometry rates if not already fill-limited
Only half of any closed polyhedron is visible at Only half of any closed polyhedron is visible at any one timeany one time
Don’t render what you can’t seeDon’t render what you can’t see
Culling: Backface
Culling: Backface Usefulness• Improves fill rate when using a native implementationImproves fill rate when using a native implementation
– Primitives are transformed and lit before cullingPrimitives are transformed and lit before culling
• Helps both geometry and fill with an application specific Helps both geometry and fill with an application specific algorithmalgorithm
– More computationally expensiveMore computationally expensive
– Balance graphics and CPU workBalance graphics and CPU work
• This may not work well when you can enter closed geometry This may not work well when you can enter closed geometry or need two-sided lightingor need two-sided lighting
Lava. Hot!
Random Quote
Try not. Do, or do not. There is no try.Try not. Do, or do not. There is no try.
- Yoda ‘80- Yoda ‘80
Culling: Contribution
If it’s too small to make a difference If it’s too small to make a difference
don’t render itdon’t render it
Culling: Contribution Usefulness• Improves geometry rate Improves geometry rate
– Culled vertices are not transformed, lit, and clippedCulled vertices are not transformed, lit, and clipped
• Improves host download rateImproves host download rate
– Less data moved from memory into graphics Less data moved from memory into graphics
• Does not change fill rateDoes not change fill rate
– Screen space projection already minimalScreen space projection already minimal
– Removes few pixels from rasterization stageRemoves few pixels from rasterization stage
Culling: Contribution ImplementationDon’t render items that fall below a size thresholdDon’t render items that fall below a size threshold• Screen space size of bounding volumeScreen space size of bounding volume
• A less computational approach A less computational approach
– Distance to object combined with some notion of global Distance to object combined with some notion of global object sizeobject size
If you can’t see itIf you can’t see it
don’t draw itdon’t draw it
Culling: Occlusion
Front Side
Culling: Occlusion GoalsFind the optimal set of occluders that will enable Find the optimal set of occluders that will enable drawing the minimal number of occludeesdrawing the minimal number of occludees• Occluders: The geometry that is visibleOccluders: The geometry that is visible
• Occludees: The geometry that is not visible Occludees: The geometry that is not visible
• Use general purpose occlusion culling algorithmsUse general purpose occlusion culling algorithms
• Use application specific spatial knowledge if possibleUse application specific spatial knowledge if possible
Culling: Occlusion Culling Usefulness• Can improve both transform-limited and fill-limited Can improve both transform-limited and fill-limited
applicationsapplications
• Computationally expensiveComputationally expensive
– Beware of time trade-offsBeware of time trade-offs
• Possible hardware supportPossible hardware support
Culling: General Occlusion Culling• Used for arbitrary scenesUsed for arbitrary scenes
• Can improve both transform limited and fill limited Can improve both transform limited and fill limited applicationsapplications
• Computationally expensive for arbitrary scenesComputationally expensive for arbitrary scenes
Culling: Occlusion Spatial Partitioning““Cell and Portal” CullingCell and Portal” Culling• Spatial organization leads to Spatial organization leads to CellsCells and and PortalsPortals
• Games that move from room to roomGames that move from room to room
• Architectural walkthroughsArchitectural walkthroughs
LOD: OverviewAfter culling, need to draw what is leftAfter culling, need to draw what is left• Still too much geometry: Still too much geometry:
– Use multiple Levels of Detail, I.e. multi-resolution objectsUse multiple Levels of Detail, I.e. multi-resolution objects
• Match geometric complexity to visible on-screen space Match geometric complexity to visible on-screen space coveragecoverage
• Reduce geometric complexity to maintain target frame rateReduce geometric complexity to maintain target frame rate
LOD: Issues• Generating LODs: Generating LODs:
– Height Fields vs 3D objectsHeight Fields vs 3D objects
– View-Dependent: nice, but compute intensiveView-Dependent: nice, but compute intensive
– View-Independent: fast, memory intensiveView-Independent: fast, memory intensive
• Need to decide which LOD level to useNeed to decide which LOD level to use
– Not trivial!Not trivial!
• Need smooth transitions between levelsNeed smooth transitions between levels
– GeomorphsGeomorphs
LOD: Height Fields• Generally thought of as infinite terrainGenerally thought of as infinite terrain
• Specialized algorithms can be usedSpecialized algorithms can be used
LOD: 3D Models• General purpose simplification algorithmGeneral purpose simplification algorithm
• Can use on height fields alsoCan use on height fields also
• Some recent real-time view-dependent algorithmsSome recent real-time view-dependent algorithms
• Also used for compressionAlso used for compression
1024 Triangles 256 Triangles 64 Triangles 16 Triangles
LOD: When to switch LOD levelsAbility to only generate LOD models is not Ability to only generate LOD models is not sufficientsufficient• Need to know when to use which LOD levelNeed to know when to use which LOD level
– single constant hard metric: distance from eyesingle constant hard metric: distance from eye
– Multiple heuristics: cost, benefit, rankingsMultiple heuristics: cost, benefit, rankings
• Can bias LODs to ensure frame rate targets are reachedCan bias LODs to ensure frame rate targets are reached
LOD:Level determination• Determine system rendering characteristicsDetermine system rendering characteristics
• Determine cost of rendering each objectDetermine cost of rendering each object
• Render objects with highest benefit while remaining under Render objects with highest benefit while remaining under the target frame ratethe target frame rate
Level determination can be time consuming!Level determination can be time consuming!““take the time to time the time taken to reduce the take the time to time the time taken to reduce the
rendering time”rendering time”
Going, and going, and going...
LOD: Determining cost of renderingCost is affected by many factorsCost is affected by many factors• Graphics hardware: published benchmarks, startup testsGraphics hardware: published benchmarks, startup tests
• Number of vertices: primarily a function of LOD algorithmNumber of vertices: primarily a function of LOD algorithm
• Rendering Quality: lighting, shading, wire frame, anti-aliasing, Rendering Quality: lighting, shading, wire frame, anti-aliasing, etc.etc.
• Global Factors: total texture memory, dirty internal stateGlobal Factors: total texture memory, dirty internal state
LOD: Benefit FunctionCost alone is not good enough, need benefit alsoCost alone is not good enough, need benefit also• Rendered size of objectRendered size of object
• Error tolerance between LOD level and reference modelError tolerance between LOD level and reference model
• Importance in sceneImportance in scene
• Frame-to-frame coherencyFrame-to-frame coherency
LOD: The Optimal LODsFor all Objects, at each LOD Level, rendered with For all Objects, at each LOD Level, rendered with each RenderTypeeach RenderTypeMaximize the Benefit function:Maximize the Benefit function: Benefit(Object, Level, RenderType)Benefit(Object, Level, RenderType)
Subject to:Subject to: Cost(Object, Level, RenderType) <= TargetFrameRateCost(Object, Level, RenderType) <= TargetFrameRate
LOD: Optimal Optimizations
• Simulated AnnealingSimulated Annealing
• Monte Carlo SimulationsMonte Carlo Simulations
• Simplex SearchesSimplex Searches
LOD: Optimal Optimizations
• Simulated AnnealingSimulated Annealing
• Monte Carlo SimulationsMonte Carlo Simulations
• Simplex SearchesSimplex Searches
Dude,Dude,Can you spare a few dozen CPUs?Can you spare a few dozen CPUs?
LOD: Trade-offsDon’t have enough time to run full LOD Don’t have enough time to run full LOD optimization problem and render the sceneoptimization problem and render the scene• Simplify cost and benefit functionsSimplify cost and benefit functions
• Simplify optimization problem into a ranking of Benefit/CostSimplify optimization problem into a ranking of Benefit/Cost
• Use frame-to-frame coherencyUse frame-to-frame coherency
• Be sure to consider time taken to calculate LODsBe sure to consider time taken to calculate LODs
Application Architectures: Multi-Threading• More stages give more time to cull or generate LODsMore stages give more time to cull or generate LODs
• Each stage adds latencyEach stage adds latency
AppApp CullCull LODLOD DrawDraw
AppApp CullCull LODLOD DrawDraw
AppApp CullCull LODLOD DrawDraw
TT00 TT11 TT22 TT33 TT44 TT55
FrameFrame00
FrameFrame11
FrameFrame22
Application Architectures: Multi-Threading• Hard part is data synchronizationHard part is data synchronization
• Watch out for memory bloatWatch out for memory bloat
Application Architectures: Scene GraphsA scene graph is the basic data structures holding A scene graph is the basic data structures holding the description of your scenethe description of your scene• Cull-able, sort-able, and can contain multi-resolution objectsCull-able, sort-able, and can contain multi-resolution objects
• Hierarchical Bounding VolumesHierarchical Bounding Volumes
• Statistics gathering and timing infrastructureStatistics gathering and timing infrastructure
• For large scenes can do memory management and database For large scenes can do memory management and database pagingpaging
Application Architectures: Trade-offs• QualityQuality
• SpeedSpeed
• MemoryMemory
• ComplexityComplexity
Conclusion: Most importantly - Think about balance!Most importantly - Think about balance!
Performance Hints
Keith Cok, SGIKeith Cok, SGI
Performance Hints:Pipeline Management• Avoid round trips to graphics serverAvoid round trips to graphics server
– Cache own state/attribute information Cache own state/attribute information
– Avoid pipeline queries (e.g., glGet*)Avoid pipeline queries (e.g., glGet*)
– Flush buffer efficiently (glFlush vs. glFinish)Flush buffer efficiently (glFlush vs. glFinish)
• Reduce state changes. Sort by expense. For example, sort Reduce state changes. Sort by expense. For example, sort geometry by type (triangles, quads, etc) and then by colorgeometry by type (triangles, quads, etc) and then by color
• Eliminate unused attributesEliminate unused attributes
Performance Hints: DebuggingDetect graphic errors:Detect graphic errors:#ifdef DEBUG#ifdef DEBUG#define GLEND() glEnd();\#define GLEND() glEnd();\ {int err; \{int err; \ err = glGetError(); \err = glGetError(); \ if (err != GL_NO_ERROR) \if (err != GL_NO_ERROR) \
printf("%s\n",gluErrorString(err)); \ printf("%s\n",gluErrorString(err)); \ assert(err == GL_NO_ERROR);}assert(err == GL_NO_ERROR);}#else#else #define GLEND() glEnd()#define GLEND() glEnd()#endif#endif
Performance Hints: Geometry• Maximize data between glBegin/glEndMaximize data between glBegin/glEnd
– Sort geometry by type (triangle, quad, etc.) and group them Sort geometry by type (triangle, quad, etc.) and group them togethertogether
– Find best fit for length of glBegin/glEnd pairFind best fit for length of glBegin/glEnd pair
• Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce geometry data sent to the pipelinegeometry data sent to the pipeline
• Avoid GL_POLYGON. Use specific geometric primitives instead Avoid GL_POLYGON. Use specific geometric primitives instead (GL_TRIANGLE, GL_QUAD, etc.)(GL_TRIANGLE, GL_QUAD, etc.)
• Use GL_FASTEST with glHint calls where possibleUse GL_FASTEST with glHint calls where possible
Performance Hints: Geometry • Use flat display lists for static geometry. Deep display lists Use flat display lists for static geometry. Deep display lists
may induce unwanted memory thrashingmay induce unwanted memory thrashing
• Use API matrix operations instead of your own Use API matrix operations instead of your own
• Use texture to simulate complex geometryUse texture to simulate complex geometry
• Use vertex arrays. Test vertex, interleaved, precompiled Use vertex arrays. Test vertex, interleaved, precompiled arraysarrays
Performance Hints: Geometry• Pass one normal (not 3 or 4) per flat shaded polygonPass one normal (not 3 or 4) per flat shaded polygon
• Use a data format suitable for quick transfer to the graphics Use a data format suitable for quick transfer to the graphics subsystemsubsystem
• Disable unneeded operations (alpha blending, depth, stencil, Disable unneeded operations (alpha blending, depth, stencil, blending, dithering, fog, etc.)blending, dithering, fog, etc.)
Performance Hints: Lighting• Reduce lighting requirements: Reduce lighting requirements:
– Use as few lights as possibleUse as few lights as possible
– Use directional (infinite) lighting. Use Use directional (infinite) lighting. Use glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0});glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0});
– Use positional lights rather than spot lightsUse positional lights rather than spot lights
– Use one-sided lighting when possible (be aware of issues Use one-sided lighting when possible (be aware of issues associated with normals)associated with normals)
– Don’t change material properties frequently Don’t change material properties frequently
Performance Hints: Lighting• Use normalized normal vectorsUse normalized normal vectors
– Supply unit length vectorsSupply unit length vectors
– Don’t enable GL_NORMALIZEDon’t enable GL_NORMALIZE
– Don’t scale using model-view matrix Don’t scale using model-view matrix
• Pre-multiply geometry, if possiblePre-multiply geometry, if possible
Performance Hints: Visuals/Pixel Formats• Pick the correct visual. Use hardware accelerated visualsPick the correct visual. Use hardware accelerated visuals
• Structure windows and contexts to maximize performance Structure windows and contexts to maximize performance (app may block after context swaps)(app may block after context swaps)
• Put GUI elements in overlay planes to avoid unwanted Put GUI elements in overlay planes to avoid unwanted graphics window refreshesgraphics window refreshes
Performance Hints: Buffers• Turn off depth buffer when possibleTurn off depth buffer when possible
• Use HW accelerated off-screen buffer for backing-storeUse HW accelerated off-screen buffer for backing-store
• Use stencil buffer for interactive picking and quick re-render Use stencil buffer for interactive picking and quick re-render (see course notes for full algorithm)(see course notes for full algorithm)
• Use color/depth buffer data for interactive editing of complex Use color/depth buffer data for interactive editing of complex scenes (see course notes for full algorithm)scenes (see course notes for full algorithm)
Performance Hints: Textures• Be aware of texture sizesBe aware of texture sizes
– Reduce texture resolutionReduce texture resolution
– Use texture LOD extension (OpenGL 1.2)Use texture LOD extension (OpenGL 1.2)
• Use texture objects. Create textures once Use texture objects. Create textures once
• Don’t swap textures frequently, if possibleDon’t swap textures frequently, if possible
– Mosaic multiple textures into one large textureMosaic multiple textures into one large texture
– Sort geometry by textureSort geometry by texture
Performance Hints: Textures• Use texture as an additional data lookup to simulate more Use texture as an additional data lookup to simulate more
complex data:complex data:– Lighting, geometry, color, clipping, application-space data Lighting, geometry, color, clipping, application-space data
• Use glTexSubImage to replace part of a texture rather than Use glTexSubImage to replace part of a texture rather than creating a whole new texturecreating a whole new texture
• Avoid expensive texture filter modesAvoid expensive texture filter modes
• Use texture lookup tables instead of multi-channel texturesUse texture lookup tables instead of multi-channel textures
ConclusionKnow how your application works within the Know how your application works within the systemsystem• Don’t let caches, latencies, bandwidths, etc. slow you downDon’t let caches, latencies, bandwidths, etc. slow you down
• Know how fast you can goKnow how fast you can go
• Identify system performance characteristicsIdentify system performance characteristics
• Work your compilerWork your compiler
• Get all you can out of the hardwareGet all you can out of the hardware
Questions and Answers