Bartłomiej Filipek www.bfilipek.com [email protected]
How does this work? General architecture Advices Tools
The lecture will not cover the technical details about the gpu, it shows only overview needed to understand current technologies and standads.
CPU
BUS Commands, Textures,
Vertices, Shaders, Data…
GPU
application
3D Api (DirectX/OpenGL)
Driver
Vertex Processing
Fragment Processing
Framebuffer
Memory
Display
Vertex Processing
Fragment Processing
Memory/textures
Vertex units
Pixel units
As we can see, previous architectures matched vertex/fragment „fixed” chain… so at the beginning all the data was processed in „vertex units” and then it was moved to fragment units.
SISD – Single Instruction Single Data
Standard way… one instruction is being executed per single data.
SIMD – Single Instruction Multiple Data Instruction is being executed per several data –
like for one 4D vector (128 bits)
MIMD – Multiple Instructions Multiple Data Parrarel processing!
Dyn
amic
tas
k d
ivis
ion
…
Fix
ed t
ask
div
isio
n…
u n u s e d
Fragment units used
Vertex units used
Effect that uses a lot vertex processing
u n u s e d
Fragment units used
Vertex units used
Effect that uses a lot fragment processing
Units
Effect that uses a lot vertex processing
Vertex units used
fragment units used
Units
Effect that uses a lot fragment processing
fragment units used
vertex units used
Vertex units/Fragment units and their quantities were fixed – we had N vertex processors, and M fragment processors, but now we have unifed architecture. That means that we have K units that can process vertex and fragments… there is no difference between them.
Shared memory
Stream processors
Controller
As we can see there are no vertex/fragment units… instead there are stream processors that can handle both vertex and fragments… and even more.
Scalars… not Vectors!
Stream processor uses only one data per instruction.
But we have a lot of SP!
SP gives far more great flexibility.
GPGPU
SIMT – Single Instruction Multiple Threads
New architecture - NV DX11, OpenCL Miltithreaded Rendering
Rendering commands can be called from difrent threads
3 000 000 000 transistors! End of 2009? End of winter 2010? Never?
Double precission callculations cost twice as much as float, not ten times as it was before!
Debugging – one can debug gpu directly from VisualStudio
Unified Shader
Geometry Shader
Vertex Shader
Fragment Shader
CUDA OpenCL DirectX Compute ATI Stream
General-purpose computing on graphics processing units
Kernels – code that will be executed on the
GPU Not only graphics but also:
Physics ▪ Fluids ▪ Collisions ▪ N-body simulations…
Financial Speach/Pattern recognition Phenomena modelling – weather… Neural nets AI
Use as few as possible: calculations Huge textures – mimpaps instead interpolators Data Rendering state changes Dynamic Vertex Buffers Textures… use texture atlases maybe Texture fetches
Use more: Batches Triangle stripes
Use Maths
Reduce calculation on uniform vars!
Normalize
Uniform sphere:
p = sqrt(Rx^2 + Ry^2 + (Rz + 1)^2) =
sqrt(Rx^2 + Ry^2 + Rz^2 + 2Rz + 1);
R vector is normalized so: Rx^2 + Ry^2 + Rz^2 = 1
p = sqrt(2 * (Rz + 1)) = 1.414*sqrt(Rz + 1)
half4 main(float2 diffuse : TEXCOORD0,
uniform sampler2D diffuseTex,
uniform half4 g_OverbrightColor) {
return tex2D(diffuseTex, diffuse) * g_OverbrightColor * 3.0;
}
dot(normalize(N), normalize(L)) uses two sqrts!
but:
(N/|N|) dot (L/|L|) = (N dot L) / (|N| * |L|) = (N dot L) / (sqrt( (N dot N) *
(L dot L) ) = (N dot L) * rsq( (N dot N) * (L dot L) )
Now we have only one sqrt – three dots are much cheaper than sqrt
Calculte this before it is send to the gpu!
Texture lookups:
~ 10 : 1 (ALU:Sampler)
Normalization cube map
Single „Dot” is not worth texture lookups…
But calculation of NormalDistribution… YES!
Early Z-Test
Depth-only Rendering, then full scene (for the second time)
Lighten number of attributes – „pack” them as possible. float4 myData is better than:
▪ float3 myDataOne; ▪ float1 myDataTwo;
But do not pack in interpolators
Use as few scalars as possible When vectors are packed no optimalizations can be performed
What do you really need?
Normal, binormal, tangent… no! You need only two of them! Binormal = normal _Cross_ Tangent
PerfKit •For DirectX mostly •Little support for OpenGL – via glExpert
PiX for Windows •Shows everything! But only for Windows, DirectX…
Similar to Pix, but for OpenGL… 800$ ;(
AMD GPU Perf
GLIntercept • OpenGL • free • log every call of opengl command • edit shaders in realtime • although it is a bit simple it has a powerful impact on debugging…
GPU ShaderAnalyzer • free, from AMD! • glsl/hlsl • shows number of asm instructions • ALU, TEX instructions, etc.. • bottlenecks
FXComposer, by NVidia
RenderMonkey by AMD/ATI
ShaderDesigner by TyphoonLabs
PPAM – slajdy - PARALLEL PROCESSING AND APPLIED MATHEMATICS, Wrocław 2009
Developer.nvidia.com glintercept.nutty.org developer.amd.com Nvidia GeForce GTX 260/280 Review