Upload
vannhu
View
214
Download
0
Embed Size (px)
Citation preview
3/23/2009
1
Game Developers Conference 2009
P i Ti F S l bl Programming Tips For Scalable Graphics Performance
March 25, 2009ROOM 2010
Luis GimenezGraphics Architect
Ganesh KumarApplication Engineer
Katen ShahGraphics Architect
• Why Optimize for Scalable Graphics
• Intel® GMA Series Architecture and Tools
Agenda
• Intel® GMA Series Architecture and Tools
• Balance Work Load Between CPU and GPU
• Minimize Runtime and Driver Overhead
• Optimize Shader Performance
2
• Case Study
• Q&A
3/23/2009
2
250
300
Developing for Integrated Graphics Allows You to Sell Your Game to More Customers!
PC Graphics Market Segment
100
150
200
Milli
ons
Desktop IntegratedDesktop DiscreteMobile IntegratedMobile Discrete
3
0
50
2007 2008 2009 2010 2011 2012 2013
Source: Mercury Research (Q4’08)
Scale Your Game!
4
3/23/2009
3
Intel ® Integrated Graphics (IIG) Architecture
MemoryCommands
Internal buses
Memory
Array ofExecution Units
VF
VS
Clip
I$ Cache
TextureCache
Sampler
RenderCache
VideoProcessing
2D DisplayCmd
Streamer
Memory /Cache
GS
Row0EU0 EU1 EUn
RowNEU EU EU
SOThre
ad D
ispat
ch
5
SetupRast /Early-Z
PixelOps
RowNEU0 EU1 EUn
Intel® GMA 3 & GMA 4 Series support SM4
Intel’s New Graphics Performance AnalyzersToday 2:30 PM – 3:30 PM in Room 3004, West Hall
FRAME ANALYZER
SYSTEM ANALYZER
6
3/23/2009
4
Optimization Hints For Intel®
Integrated Graphics
How to avoid frequent pitfalls found in testing integrated graphics playability over numerous games every year
• Balance Workload Between CPU and GPU
• Minimize Runtime and Driver Overhead
7
Minimize Runtime and Driver Overhead
• Optimize Shader Performance
Balance The Workload between the CPU and the GPU
OCEAN FOG DEMO
• Massive Data Parallelism
• Per Pixel Lighting
• Shadows
• Post Processing
Bl di
• Complex Algorithms
• Physics/AI
• Simulation
• Animation
8
Pre-computing the Perlin textures in the CPU and using the GPU for Rendering nearly doubled the frame rate
http://software.intel.com/en-us/articles/ocean-fog-using-direct3d-10/
• Blending
• Animation• Pre-computing
3/23/2009
5
Maximize CPU and GPU Utilization:
Avoid Stalling the Pipeline!
To avoid stalling the CPU minimize …
• CPU data read-back• Serializing Event Queries2. CPU …Map() Resource
GPUSt i R
Copy output
9
CMD Buffer1.CopyResource
Render
Command
Command
3. CPU Stall Until Flush
Staging Resource
CPUF3
CPUF0
CPUF1
CPUF2
STUTTERINGCPUF4
CPUF5
Maximize CPU and GPU Utilization:
Avoid Stalling the Pipeline!
GPUF0
GPUF1
GPUF2
GPUF3
CPUF0
GPUF0
CPUF1
CPUF2
GPUF1
GPUF2
STALL STALL
GPUF4
To avoid stalling the CPU minimize …
• CPU data read-back• Serializing Event Queries
Put Space between locks…• Synchronize to N-1 to N-2 frames
10
CPUF0
GPUF0
CPUF1
CPUF2
GPUF1
GPUF2
CPUF3 N-2 SYNCH
3/23/2009
6
• The IIG driver optimizes the workload before sending it to the GPU
Maximize CPU and GPU Utilization:
Avoid Stalling the Pipeline!
To avoid stalling the CPU minimize …
• CPU data read-back• Serializing Event Queries
Put Space between locks…• Synchronize to N-1 to N-2 frames
Memory
Cmd Parser
Vertex Shader
Geometry
App
Direct3D
Intel Driver
Commands
Vertex Buffers
Index Buffer
Texture…
11
Reduce CPU work, optimize Driver performance by reducing…
• State Changes
• Creation and Destruction of Resources
Geometry Shader
Stream Out
Clipper
Setup/ Rasterization
Pixel
Shader
Output Merger
Texture…
Buffer
Texture…
Depth / Color
Display Buffer
Optimization Hints For Intel®
Integrated Graphics
• Balance load Between CPU and GPU
• Minimize Runtime and Driver Overhead
• Optimize Shader Performance
12
p
3/23/2009
7
• DirectX 10 manages resources based on USAGE and CPU_ACCESS_FLAG • The best memory location is decided by OS/driver/memory manager
Minimizing Runtime and Driver Overhead
Manage Your DirectX 10 Resources!
DX10 Usage /Update Freq
Access CPU Resource Update USE
IMMUTABLENever
GPU read Create…()Load @ create never updated
Static VBs/ IBs/Textures
DEFAULT<=1 per frame
GPU read-write
Copy…(), Update…() use only for CBs and small textures
VBs/IBs/CBs /Textures
DYNAMIC> 1 per Frame
CPU writeGPU readCopy()
Map() w. WRITE_NO_OVERWRITE partial update of VBs/IBs
WRITE_DISCARD for full update or CBs
Dynamic Update VBs/ IBsCBs
NO
NM
AP
PA
BLE
Copy…()CBs
STAGINGtransfer data
to the GPU
CPU read-write
GPUindirect
read/ write
Map() for write to mapped memory WRITE/DO_NOT_WAIT_FLAG to avoid stalls
Copy…() from staging resource to video Memory
Texture updates
transfer data to the GPU
CPURead-back from GPU
Copy() GPU output to staging resourceMap() for read w. DO_NOT_WAIT_FLAG to avoid stall
Surfaces for read-back /
MA
PP
AB
LE
Minimizing Runtime and Driver Overhead
Optimize Your Constants Access!• IIG Driver optimizes for DX9/10 the
most frequently used constants– Avoid global constants
– Limit Dynamicindexed Constants yC[a0] C[r]
• In DX10 when a constant changes the complete buffer gets updated – Group cbuffers by frequency of
updates
– Organize cbuffers based on feature scaling
Fog Demo
14
scaling
– Inside cbuffer put constants by access sequence
– Inside cbuffers pack data into float4 boundaries
http://software.intel.com/en-us/articles/directx-constants-optimizations-for-intel-integrated-graphics/
3/23/2009
8
Minimizing Runtime and Driver Overhead
Batch Your Primitives!
15
• Use large batches >200-1K primitives
• Minimize State Changes between batches
• Use Instancing for Small Batches http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10/
Optimization Hints For Intel®
Integrated Graphics
• Balance load Between CPU and GPU
• Minimize Runtime and Driver Overhead
• Optimize Shader Performance
16
p
3/23/2009
9
Optimizing Shader Performance
Skip Computes that do not Render!
• Test for visibility to reject objects that fall outside the view frustum that fall outside the view frustum
• Maximize Use of Early-Z (cost 4 pixels/clock hardware) • Avoid modified Z value (oDepth) in the
pixel shader
• Use Occlusion Query for complex
17
Use Occlusion Query for complex scenes
• Use LOD to reduce complexity for objects that are distant
Array ofExecution Units
VF
VS
I$ Cache
TextureCache
Sampler
Row0EU0 EU1 EUn
Dis
patc
h
CmdStreamer
Optimizing Shader Performance
Optimize the Use of the Intel Integrated Graphics HW!
• For best EUs Utilization minimize registry usage • Sample Textures to >4:1 ratio of #Instructions per Texture Sample• Large shader impacts performance due to limited number of registers• Smart Usage of Flow Control
Clip
SetupRast /
Early-Z
RenderCache
PixelOps
GS
RowNEU0 EU1 EUn
SO
Thre
ad D
18
• Smart Usage of Flow Control
• Mask alpha when not needed• Minimize use of transcendentals like LOG, POW, EXP etc. • Pre-load Shaders to avoid Mid-Scene Compiles• Avoid Mid-Scene textures changes
3/23/2009
10
• Keep your Textures under 256x256 and same format if possible
• Prefer Multi texture over Multi Pass
Optimizing Shader Performance
Scale Your Pixel Shader and Textures!
• Prefer Multi-texture over Multi-Pass
• Use Compressed Textures and mip-maps
• Use Texture arrays / Texture Atlas
• Minimize Lock/Blit of Z and/or Stencil Buffer
• Use Shadow Maps for IIG and Stencil
19
• Use Shadow Maps for IIG and Stencil Shadows as scalable feature
• Minimize Clear() surfaces
• Minimize post processing passes
Optimizing for IIG: Demigod
20
3/23/2009
11
Key Lessons Learned from Optimizing Demigod for IIG
21
Be Wary of ‘Clear’ Calls
Why:- Costlier than you might think- Affects every pixel on surfacey p
Recommendations:- Make sure unused surfaces don’t get cleared unnecessarily- Consider reducing surface resolution when in lower LOD- Clear Color, Stencil and Z-Buffer in the same API call
22
3/23/2009
12
Prune Costly Clear Calls
23
Reduce the Number of Texture Fetches
• Texture cache is limited on integrated graphics
• Reducing Texture sizes alone doesn’t help as much
• Optimize Shaders by reducing texture fetches in Low Fidelity modes
24
• Balance Texture load instructions with arithmetic instructions if possible
3/23/2009
13
Simplify Post Processing Effects
• Post Processing Effects that use multiple passes• Bloom• Motion Blur• Depth of Field• High Dynamic Range
• Balance visual quality with speed by reducing the number of passes
25
After
Demigod Bloom Effect
Before
26
Bloom turned OffBloom Onwith Fewer Passes
3/23/2009
14
Avoid Pixel Overdraw
• Render opaque objects from Front to Back- Render UI and other HUDs firstRender UI and other HUDs first- Render Sky and Terrain last
• Early-Z architecture eliminates occluded pixels early in the pipeline
27
Example of Back to Front Rendering
28
3/23/2009
15
Moving Terrain Rendering to the End
29
Lastly, Add Benchmark Mode to Your Game for Performance Profiling!
It helps to characterize the workload
Four Key requirements benchmark must provide
1. Accurately reflect real workload
2. Repeatability
3. Ability to run standalone without Internet
4 Abilit t A t t b ilt i d d li
30
4. Ability to Automate – built-in demo, command-line
execution and output to a log file
3/23/2009
16
Summary
Scale Your Game for Integrated! • Balance CPU and GPU Workload, Avoid Stalls
• Minimize Run Time and Driver Overhead
• Optimize your shader performance by scaling your game
• Analyze your game, find your most expensive call
31
• Balance your visual effects against performance penalties
• Add benchmark mode to your game
Additional ResourcesDevelopers Guide for Intel® Integrated Graphics
• http://software.intel.com/en-us/articles/intel-graphics-media-accelerator-developers-guide
Articles Mentioned in this Presentation• http://software intel com/en-us/articles/ocean-fog-using-direct3d-10http://software.intel.com/en us/articles/ocean fog using direct3d 10• http://software.intel.com/en-us/articles/directx-constants-optimizations-for-intel-
integrated-graphics/• http://software.intel.com/en-us/articles/rendering-grass-with-instancing-in-directx-10
Intel® Graphics Performance Analyzer• www.intel.com/software/gpa
Intel® Graphics Community• http://softwarecommunities.intel.com/communities/visualcomputing
323232
Integrated Graphics Software Development Forum• http://softwarecommunities.intel.com/isn/Community/en-
US/forums/2414/ShowForum.aspx
Intel® Laptop Gaming TDK• http://softwarecommunities.intel.com/articles/eng/1017.htm
32
3/23/2009
17
Training the Next Generation
Enhance Your Productsand Your Business
The gateway to Intel’s worldwide technology engineering and go-to-market
Get the“Story Behind the Story”
Investing in Talent and Technology See What’s New
Developers Connecting with Intel Engineers
technology, engineering and go to market support for Visual Computing developers
33
www.intel.com/software/visualadrenaline
For More Information
http://www.intel.com/software/gdc
Contact infoContact info
See Intel at GDC: - Intel Booth at Expo, North Hall- Intel Interactive Lounge – West Hall 3rd floor
Take a collateral DVD
34
Take a collateral DVD- Here in the room!- Intel Booth or Interactive Lounge
3/23/2009
18
Intel @ GDCWednesday, March 25
Programming Tips for Scalable Graphics10:30 AM – 11:30 AM in Room 2010, West Hall
Threaded AI For the Win!12:00 PM – 1:00 PM in Room 2011, West Hall
Intel’s New Graphics Performance Analyzers2:30 PM – 3:30 PM in Room 3004, West Hall
Kaboom: Real-Time Multi-Threaded Fluid Simulation for Games4:00 PM – 5:00 PM in Room 2011, West Hall
Thursday, March 26Who Moved the Goalposts? The Rapidly Changing World of CPU’s
and Optimization
35
p1:30 PM – 2:30 PM in Room 2011, West Hall
Taming Your Game Production Demons: the Offset approach3:00 PM – 4:00 PM in Room 2011, West Hall
Optimizing Game Architectures with Intel Threading Building Blocks4:30 PM – 5:30 PM in Room 2011, West Hall
Last of Intel @ GDC
Friday, March 27Procedural and Multi-Core Techniques to take Visuals to the Next Level• 9:00 AM – 10:00 AM in Room 2010, West Hall
Rasterization on Larrabee: A First Look at the LarrabeeNew Instructions (LRBni) in Action• 9:00 AM – 10:00 AM in Room 135, North Hall
SIMD Programming on Larrabee: A Second Look at the
36
Larrabee New Instructions (LRBni) in Action • 10:30 AM – 11:30 AM in Room 3002, West Hall
3/23/2009
19
Risk FactorsThis presentation contains forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our
b i f i f i h i k f h ld website for more information on the risk factors that could cause actual results to differ.
37
Rev. 4/17/07
Backup Slides
39
3/23/2009
20
Both Intel GMA 3 and 4 support DirectX 10
Make your Scaling API Independent!
Game Scaling
DX8 DX9 DX10
High Detail
dat
ion
40
Standard Detail
Low Detail
Rec
om
men
d
Both Intel GMA 3 and 4 support all required D3D10 Features
• D3D10 Optional FeaturesMSAA: only single sample supported- MSAA: only single sample supported
- 32-bit FP Filtering: not supported- 16bit UNORM Blending: Supported in GMA X4XXX and beyond- RGB32 RT: Not supported- Use D3D10Device::CheckFormatSupport to check for supported formats
• Other D3D10 performance considerationsLimit Use of GS make it scale feature
41
–Limit Use of GS make it scale feature–Use different Stream Out buffers for different SO formats
Check for Optional Features before Use them
3/23/2009
21