View
226
Download
0
Embed Size (px)
Citation preview
11
Shader Performance Shader Performance Analysis on a Modern GPU Analysis on a Modern GPU
ArchitectureArchitectureVictor Moya, Carlos González,Victor Moya, Carlos González,
Jordi Roca, Agustín FernándezJordi Roca, Agustín FernándezDepartment of Computer Department of Computer
Architecture UPCArchitecture UPC
Roger EspasaRoger EspasaIntel DEGIntel DEGBarcelonaBarcelona
22
IntroductionIntroduction
Shaders in GPUs evolving towards general Shaders in GPUs evolving towards general programmingprogramming Branches, generic loads, scatterBranches, generic loads, scatter
New types of shaders: geometry in DX10New types of shaders: geometry in DX10Current specialized shadersCurrent specialized shaders Area hungryArea hungry Unbalancing leads to inefficienciesUnbalancing leads to inefficiencies
This paper: unify all shadersThis paper: unify all shaders ~8% higher performance with less area & resources~8% higher performance with less area & resources
33
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
44
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
55
ATTILAATTILA
Our implementation of current GPUsOur implementation of current GPUs Inspired in both NVIDIA and ATIInspired in both NVIDIA and ATI Not exact to either pipelineNot exact to either pipeline
Lack of detailed micro architecture informationLack of detailed micro architecture informationEducated guessing on our sideEducated guessing on our side
Implemented FeaturesImplemented Features 2D Homogeneous Recursive Rasterization2D Homogeneous Recursive Rasterization Tiled RasterizationTiled Rasterization Hierarchical ZHierarchical Z Texture compressionTexture compression Anisotropic filteringAnisotropic filtering Depth compression, fast z/stencil and color clearDepth compression, fast z/stencil and color clear
66
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
77
Vertex Shader
Vertex Shader
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Triangle Setup
Rasterization
FragmentShader
FragmentShader
FragmentShader
FragmentShader
ROP ROP ROP ROP
HierarchicalZ
Vertex Fetch
MemoryController
MemoryController
MemoryController
MemoryController
Attila ClassicAttila Classic
SpecializedShaders
88
Specialized Shader IssuesSpecialized Shader Issues
UnbalancingUnbalancing In fragment shading limited scenarios (typical) up to 30% of the In fragment shading limited scenarios (typical) up to 30% of the
processing power remains idle (for a GPU with 8 vertex and 4 processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders)fragment shaders)
In vertex shading limited scenarios up to 70% of the processing In vertex shading limited scenarios up to 70% of the processing power remains idle.power remains idle.
Dedicated AreaDedicated Area 4 unused vertex shaders have the same processing power than 4 unused vertex shaders have the same processing power than
one 1 fragment shaderone 1 fragment shader 4 vertex shaders require 66% the area of a fragment shader4 vertex shaders require 66% the area of a fragment shader
Different DesignsDifferent Designs Increases the complexity of the micro architectureIncreases the complexity of the micro architecture Increases development and verification timeIncreases development and verification time
99
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
1010
MemoryController
MemoryController
MemoryController
MemoryController
ROP ROP ROP ROP
Shader
Shader
Shader
Shader
Vertex Fetch
Primitive Assembly
Clipping
Triangle Setup
Rasterization
HierarchicalZ
Scheduler
Distributor
Attila UnifiedAttila Unified
UnifiedShader
Pool
1111
Unified Shader ArchitectureUnified Shader Architecture
BenefitsBenefits Unified programming modelUnified programming model
DX10/SM4 and OpenGL/GLSlang are already pushing for itDX10/SM4 and OpenGL/GLSlang are already pushing for it
The same features for all the program targetsThe same features for all the program targetsTexturing, branching, outputsTexturing, branching, outputs
Not just vertex and fragment programsNot just vertex and fragment programsDX10 => geometry shaderDX10 => geometry shaderGeneral Purpose GPU or Stream ProcessorGeneral Purpose GPU or Stream Processor
Workload balanceWorkload balanceShading resources allocated as required at any point of the Shading resources allocated as required at any point of the renderingrendering
1212
Unified Shader ArchitectureUnified Shader Architecture
CostsCosts SchedulerScheduler
Select which kind of workload must be processed Select which kind of workload must be processed nextnext
Partly implemented with multithreading in the Partly implemented with multithreading in the fragment shader to hide texture access latencyfragment shader to hide texture access latency
Larger instruction memory and constant bankLarger instruction memory and constant bank Rerouting requiredRerouting required
All the paths cross the shader poolAll the paths cross the shader pool
1313
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
1414
ATTILA FrameworkATTILA FrameworkOpenGL Interceptor toolOpenGL Interceptor tool
OpenGL library for Attila GPUOpenGL library for Attila GPU
Driver for our Attila GPUDriver for our Attila GPU
Attila GPU simulatorAttila GPU simulator
Signal Visualizer ToolSignal Visualizer Tool
1515
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
1616
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLInterceptor
•Capture a trace of OpenGL API alls from a real game
1717
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLPlayer
•Reproduce the captured trace
1818
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
OpenGL Library- Transforms Fixed Function into Shader code- Transforms Fixed Function into Shader code- 200 API Calls supported- 200 API Calls supported- ARB Vertex and Fragment extensions- ARB Vertex and Fragment extensions- Alpha and Fog emulated via Shader code- Alpha and Fog emulated via Shader code
DriverDriver- Low level access- Low level access- Attila memory management- Attila memory management
1919
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
ATTILA SimulatorATTILA Simulator- Detailed cycle-by-cycle simulation of all - Detailed cycle-by-cycle simulation of all
pipeline stagespipeline stages- 20 boxes, modeling a 100-deep pipeline- 20 boxes, modeling a 100-deep pipeline- Execute@Execute: functionality - Execute@Execute: functionality
embedded at each pipeline stageembedded at each pipeline stage
2020
Find the differences Find the differences
NVIDIA GeForce FX 5900XT Attila
2121
OutlineOutline
Attila – our GPU architectureAttila – our GPU architecture
Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders
Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders
Simulation FrameworkSimulation Framework
ResultsResults
2222
BenchmarkBenchmark
Unreal Tournament 2004Unreal Tournament 2004 Fixed function OpenGL APIFixed function OpenGL API
Vertex and fragments shaders generated by our Vertex and fragments shaders generated by our librarylibrary
1024x768 resolution1024x768 resolution 8x Anisotropic Filtering8x Anisotropic Filtering 160 of 450 frames simulated160 of 450 frames simulated 40 frames ~ 1 day simulation 40 frames ~ 1 day simulation
On a Xeon P4 @ 2.0GhzOn a Xeon P4 @ 2.0Ghz
2323
Baseline ConfigurationBaseline Configuration
Four Vertex Shaders (only for Attila- Classic)Four Vertex Shaders (only for Attila- Classic)Fragment and Unified shader configuration:Fragment and Unified shader configuration:
32 threads32 threads4 fragments/vertices per thread4 fragments/vertices per thread16 128-bit FP registers available for temporal storage per thread16 128-bit FP registers available for temporal storage per thread
n SIMD ALUsn SIMD ALUs 1 scalar ALU (optional)1 scalar ALU (optional) 1 Texture Unit per Shader Unit1 Texture Unit per Shader Unit
16 KB texture cache16 KB texture cacheSingle cycle bilinear and two cycle trilinearSingle cycle bilinear and two cycle trilinearAF up to 16x AF up to 16x
Geometry and Rasterization pipelines limited to 1 vertex and 1 Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycletriangle per cycleTwo ROPs: 8 z and 8 color values written per cycleTwo ROPs: 8 z and 8 color values written per cycleFour 64-bit DDR buses: peak bandwidth 64 bytes/cycleFour 64-bit DDR buses: peak bandwidth 64 bytes/cycle
2424
1
1,2
1,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
1-way 1-way + scalar 2-way 4-way
rela
tiv
e p
erf
orm
an
ce
2sh
4sh
6sh
8sh
““Classic” PerformanceClassic” Performance
8% improvement for 2-way8% improvement for 2-wayNear linear improvement for 4 shadersNear linear improvement for 4 shadersSublinear improvement for 6 and 8 shadersSublinear improvement for 6 and 8 shaders
Limited by memory bandwidth and latencyLimited by memory bandwidth and latency
8sh
6sh
4sh
2sh
~75%
~45%
~40%
7%
8%
2525
Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units
Frame 330 – Detailed ZoomFrame 330 – Detailed Zoom
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 101 201 301 401 501 601 701 801 901
Time (10K cycles steps)
Uti
liza
tio
n
Vertex Shader
Fragment Shader
Vertex shading limited
2626
1
1,2
1,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
1-way 1-way + scalar 2-way 4-way
rela
tive
per
form
ance
2sh
uni2sh
4sh
uni4sh
6sh
uni6sh
8sh
uni8sh
Unified Shader PerformanceUnified Shader Performance
Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders)
Fragment shading limited Vertex fetch limited Geometry pipeline limited
8sh
6sh
4sh
2sh
2727
Area EstimationArea EstimationATI R400ATI R400 ATI RV400ATI RV400
Transistors (millions)Transistors (millions) 160160 120120
Vertex ShadersVertex Shaders 66 44
Fragment ShadersFragment Shaders 44 22
Hardware ElementHardware Element
Estimated AreaEstimated Area
Millions of TransistorsMillions of Transistors
Vertex ShaderVertex Shader 2.52.5
Fragment ShaderFragment Shader 1515
Additional SIMD ALUAdditional SIMD ALU +15%+15%
Additional scalar ALUAdditional scalar ALU +5%+5%
160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)
2828
Shader Scaling vs TransistorsShader Scaling vs Transistors
50
70
90
110
130
150
170
30 80 130 180
MTransistors
fps
2-way
uni 2-way
1-way
uni 1-way
linear
8sh
6sh
4sh
2sh
Linear for 4 shader units, sublinear for more than 4 shader unitsLinear for 4 shader units, sublinear for more than 4 shader unitsUp to 30% more efficient per area for the unified architecture (two 1-way Up to 30% more efficient per area for the unified architecture (two 1-way shaders)shaders)
2929
ConclusionConclusion
Attila Unified architecture has better Attila Unified architecture has better performance than Attila Classic with less performance than Attila Classic with less hardwarehardware Up to 8% better performanceUp to 8% better performance 8% to 25% less area required8% to 25% less area required 10% to 30% better performance per area10% to 30% better performance per area
Up to 8% better performance for 2-way shader Up to 8% better performance for 2-way shader unitsunits160% better performance from 2 to 8 fragment 160% better performance from 2 to 8 fragment or unified shader unitsor unified shader units Memory bandwidth limited beyond 4 shadersMemory bandwidth limited beyond 4 shaders
3030
QuestionsQuestions
3131
Performance of Attila Unified vs Classic AttilaPerformance of Attila Unified vs Classic Attila
1
1,01
1,02
1,03
1,04
1,05
1,06
1,07
1,08
1,09
uni2sh uni4sh uni6sh uni8sh
rela
tiv
e p
erf
orm
an
ce
1-way
1-way + scalar
2-way
4-way