ELIS – Multimedia Lab
Charles HollemeerschBart Pieters28/11/2008
Gastcollege GPU-team Multimedia Lab
2/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Who we are• Charles Hollemeersch
– PhD student at Multimedia Lab– [email protected]
• Bart Pieters– PhD student at Multimedia Lab– [email protected]
• Visit our website– http://multimedialab.elis.ugent.be/ and
http://multimedialab.elis.ugent.be/GPU
3/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Our Research Topics• Video acceleration
– accelerate state-of-the-art video codecs using the GPU• Game technology
– texture compression, parallel game actors …• Medical visualization
– reconstruction of medical images …• Multi-GPU applications
– …
4/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Introducing Multimedia Lab’s ‘Supercomputer’
• Quad GPU PC– four GeForce 280GTX video cards– 3732 gigaflops of GPU processing power
5/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Agenda• 8u30 – 9u45
– Bart – GPGPU• 10u00 – 11u15
– Charles – Game Technology
ELIS – Multimedia Lab
Bart PietersMultimedia Lab – UGent
28/11/2008
7/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Overview
• Introduction – GPU– GPGPU
• Programming Concepts and Mappings– Direct3D and OpenGL– NVIDIA CUDA
• Case Study: Decoding H.264/AVC– motion compensation– results
• Conclusions• Q&A
8/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Graphics Processing Unit (GPU)• Programmable chip on graphics cards• Developed in a gaming context
– 3-D scenery by means of rasterization
• Programmable pipeline since DirectX 8.1– vertex, geometry, and pixel shaders– high-level language support
• Modern GPUs support high-precision– 32-bit floating point
• Massive floating-point processing power– 933 gigaflops (NVIDIA GeForce
280GTX)– 141.7 GB/s peak memory bandwidth– fast PCI-Express bus, up to 2GB/sec
transfer speed
9/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
CPU and GPU Comparison
Intel Xeon X5355 NVIDIA G80 (8800 GTX)
Clock Speed 2,66 GHz 575 MHz
#Cores / SPEs 4 128
Max. GFlop/s (float) 85 500
Typical Instr. Duration 1-2 cycles (SSE) min. 4 cycles
Die Size (mm²) 143 480
Typical Memory Speed 8GB/sec (DDR2-1066) 86 GB/sec (GDDR3-1800)
Power Usage (watt) 120 185
Price (€) 800 500
• Today’s GPUs are yesterday’s supercomputers
10/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Why are GPUs so fast?• Parallelism
– massively-parallel/many-core architecture• needs a lot of work to be efficient
– specialized hardware build for parallel tasks– more transistors mean more performance
• Multi-billion dollar gaming industry drives innovation
Control
Cache
ALUALU
DRAM DRAM
GPUCPU
11/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Computational Model: Stream Processing Model
• GPU is practically a stream processor• Applications consist of streams and kernels• Each kernel takes relatively long to process (PCIe, memory
latency)– latency hidden by throughput
Input Stream
Kernel
Output Stream
12/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Inside a modern GPU
L2
FB
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Data Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
13/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Introducing GPGPU• The GPU on commodity video cards has evolved into a
processor that is– powerful– flexible– inexpensive– precise
Attractive platform for general-purpose computation
14/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPGPU• General-Purpose GPU
– use the GPU for general-purpose algorithms
• No magical GPU compiler– still no x86 processor (Larrabee?)– explicit mappings required using advanced APIs
• Programming close to the hardware– trend for higher abstraction, i.e. NVIDIA CUDA
• Techniques are suited for future many-core architectures– future CPU/GPU projects, AMD Fusion, Larrabee, …
• Dependency issues– hundreds of independent tasks required for efficient use
15/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Stream Processing Model Revisited• GPU is practically a stream processor• Applications consist of streams and kernels• Read back is not possible
Input Stream
Kernel
Output Stream
16/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPGPU in Practice
Fluid
Dynam
ics
Protei
n Fold
ing
Neural
Network
s
Finan
cial C
alcula
tions
Matrix
Multipl
icatio
n
Ray Tr
acing
Medica
l Imag
ing
Video
Coding
0
20
40
60
80… times faster
17/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPGPU APIs• Classic way
– (mis)use graphics pipeline• render a special ‘scene’
– Direct3D, OpenGL– pixel, geometry, and vertex shaders
• New APIs specifically for GPGPU computations– NVIDIA CUDA, ATI CTM, DirectX11 Compute Shader,
OpenCl
18/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Overview
• Introduction – GPU– GPGPU
• Programming Concepts and Mappings– Direct3D and OpenGL– NVIDIA CUDA
• Case Study: Decoding H.264/AVC– motion compensation– results
• Conclusions• Q&A
19/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
3-D Pipeline• Deep pipeline
GPUCPU
Application Transform Rasterizer Shade VideoMemory
(Textures)Vertices(3D)
Xformed,Lit
Vertices(2D)
Fragments(pre-pixels)
Finalpixels
(Color, Depth)
Graphics State
Render-to-texture
Programmable
Programmable
20/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
3-D Pipeline - Transform• Vertex Shader
– processing geometry data– input is a vertex
• position• texture coordinates• vertex color,...
– output is a vertex
struct Vertex{ float3 position : POSITION; float4 color : COLOR0;};
Vertex wave(Vertex vin){ Vertex vout; vout.x = vin.x; vout.y = vin.y; vout.z = (sin(vin.x) + sin(IN.wave.x)) * 2.5f;
vout.color = float4(1.0f, 1.0f, 1.0f, 1.0f); return vout;}
Vertex Shader
21/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
• Pixel (or fragment) Shader– input is interpolated vertex data
• position• texture coordinates• normals, …
– use texels from a texture– output is a fragment
• pixel color• transparancy• depth
– result is stored in the frame buffer or in a texture
• ‘Render to Texture’
PSOut shade(PSIn pin){
PSOut pout; pout.color = tex(pin.tex, sampler)
return pout;
}
PixelShader
struct PSOut{ float4 color; : COLOR0};
struct PSIn{ float2 tex; : TEXCOORD0};
3-D Pipeline - Shading
22/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPU-CPU Analogies• Explicit mapping on 3-D concepts is necessary• Rewrite an algorithm and find parallelism• Use the GPU in parallel to the CPU
1. upload data to the GPU• very fast PCI-Express bus, up to 2GB/sec transfer speed
2. process the data• meanwhile the CPU is available
3. download result to the CPU• recent GPU models have high download speed
23/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Intermediary Buffer in System Memory
GPU-CPU Pipelined Design
CPUGP
U
GPU DataGPU Data
workPrepare GPU Data workPrepare GPU Data
Process Data, Visualize Process Data, Visualize
24/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPU-CPU Analogies (2)
uv
CPU GPU
Array Texture
25/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPU-CPU Analogies (3) …
fish[] = createfish() …
for all pixels bwfish[i][j]= bw(fish[i][j]); …
CPU GPU
Render
Array Write = Render to Texture
26/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPU-CPU Analogies (4)
Loop body / kernel / algorithm step = Fragment Program
CPU GPUMotion
Compensation
for (int y=0;y<height;++y){ for (int x=0;x<width;++x) { Vec2 mv = mvectors[y/4][x/4];
int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox]; }}
PSOut motioncompens(PSIn in){ PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
Vec2 mv = mvectors[y/4][x/4];
int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox];
C++ Microsoft HLSL
27/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPU Loop for Each Pixel
Vertex Shader
Rasterizer Pixel Shader
PSOut motioncompens(PSIn in){ PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
PSOut motioncompens(PSIn in){ PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
PSOut motioncompens(PSIn in){ PSOut out; Vec2 mv = in.mv; Vec2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
PSOut motioncompens(PSIn in){ PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
…PSOut motioncompens(PSIn in){ PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords;
texcoords += mv; out.color = tex2d(texcoords, sampler);}
Render a Quad
28/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Overview
• Introduction – GPU– GPGPU
• Programming Concepts and Mappings– Direct3D and OpenGL– NVIDIA CUDA
• Case Study: Decoding H.264/AVC– motion compensation– results
• Conclusions• Q&A
29/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
GPGPU-specific APIs• NVIDIA CUDA
– Compute Unified Device Architecture– C-code with annotations compiled to executable code
• DirectX 11 Compute Shader– shader execution without rendering– technology preview available in latest DirectX SDK
• OpenCl– Open Computing Language– C++-code with annotations
• ATI CTM– Close to The Metal– GPU assembler– depricated
30/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
NVIDIA CUDA• General-Purpose GPU Computing Platform• GPU is a super-threaded co-processor
– acceleration of massive amounts of GPU threads• Supported on NVIDIA G80 and higher
– 50-500EUR price range• No more (mis)use of 3-D API• C-code with annotations for
– memory location– host or device functions– thread synchronization
• Compilation with CUDA-compiler– split host and device code– linkable object code
31/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
NVIDIA CUDA - Examplevoid runGPUTest() { CUT_DEVICE_INIT();
... float* d_data = NULL;
// allocate gpu memory cudaMalloc( (void**) &d_data, size);
dim3 dimBlock(8, 8, 1); dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
// run kernel on gpu transformKernel<<< dimGrid, dimBlock, 0 >>>( d_data );
// download cudaMemcpy( h_data, d_data, size, cudaMemcpyDeviceToHost);
...}
32/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
NVIDIA CUDA – Example (2)__ global__ void transformKernel( float* g_odata) { // calculate normalized texture coordinates unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
int2 mv = tex2D(mvtex, x, y);
int mx = x + mv.x; int my = y + mv.y; g_odata[y*width + x] = tex2D(reftex, mx, my);}
33/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Device (GPU)
Grid 1
Programming ModelHost (CPU)
Block(0, 0)
Block(1, 0)
Grid 2
Kernel 1
Kernel 2
Block(0, 1)
Block(1, 1)
Block(2, 0)Block(2, 1)
Block (1, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(0, 3)
Thread(1, 3)
Thread(2, 3)
34/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Hardware Model• Multiprocessor – MP (16)• Streaming Processor (8 per MP)
– handles one thread• Memory
– very fast high-latency– uncached– special memory hardware
for constants & texture (cached)
• Registers– limited amount
35/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
CUDA Threads• Each Streaming Processor handles
one thread– 240 on GeForce 280GTX!
• Smart hardware can schedule thousands of threads on 240 processors
• Extremely lightweight– not like CPU threads
• Threads per Multiprocessor handled in SIMD manner– each thread executes the same
instruction at a given clock cycle– lock-step execution
Block (1, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(0, 3)
Thread(1, 3)
Thread(2, 3)
36/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
…
x 3 y … z …
Lock-step Execution
x = x * 2;If ( x > 10) z = 0;Else z = y / 2 ++x
Thread 1
x 100 y … z …
x = x * 2;If ( x > 10) z = 0;Else z = y / 2 ++x
Thread 2
x 200 y … z …
x = x * 2;If ( x > 10) z = 0;Else z = y / 2 ++x
Thread 31
x 1 y … z …
x = x * 2;If ( x > 10) z = 0;Else z = y / 2 ++x
Thread 32
Locked Locked Locked Locked
Program Counter
Program Counter
Program CounterProgram Counter
Program Counter
• Heavy branching needs to be avoided
37/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Overview
• Introduction – GPU– GPGPU
• Programming Concepts and Mappings– Direct3D and OpenGL– NVIDIA CUDA
• Case Study: Decoding H.264/AVC– motion compensation– results
• Conclusions• Q&A
38/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Decoding H.264/AVC• Many decoding steps are suitable for parallelization
– quantization– transformation– motion compensation– deblocking– color space conversion
• Others introduce dependencies – entropy coding– intra prediction
39/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Video Coding Hardware• Specialized on-board 2-D video processing chips
– one macroblock at the time– black boxes
• limited support for non-windows systems– limited support for various video codecs
• e.g. H.264/AVC profiles– partly programmable
• GPU– millions of transistors– accessible via 3-D API or General Purpose GPU API
40/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Decoding an H.264/AVC bitstream• H.264/AVC
– recent video coding standard– successor of MPEG-4 Visual
• Computationally intensive– multiple reference frames (up to 16)– B-pictures– sub-pixel interpolations
• Motion compensation, reconstruction, deblocking, and color space conversion– takes up to 80% of total processing time– suitable for execution on the GPU
41/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Intermediary Buffer in System Memory
Pipelined Design for Video Decoding
CPUGP
U
MVs ResidueMVs Residue
VLD, IQ, InverseTransformation
ReadBitstream
VLD, IQ, Inverse Transformation
ReadBitstream
CSC,Visualization
MC, Reconstr.,Deblocking
CSC,Visualization
MC, Reconstr.,Deblocking
QPsQPs
42/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Input SequenceReference Picture Current Picture
(x1,y1)
Motion Vectors Prediction
Time
Motion Compensatio
n
Motion Compensation
(x2,y2)(x3,y3)
…
Residual Data
=
-
43/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Reference Picture
(x1,y1)
Motion Vectors Prediction
Time
Motion Compensatio
n
Motion Compensation: Decoder
(x2,y2)(x3,y3)
…
Residual Data
+
44/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Output Array
Motion Compensation in CUDADevice (GPU)
Kernel 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Block(2, 0)
Block(2, 1)
Block (1, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
45/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Output Array
Motion Compensation in CUDADevice (GPU)
Kernel 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Block(2, 0)
Block(2, 1)
Block (1, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
46/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Output Array
Motion Compensation in CUDADevice (GPU)
Kernel 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Block(2, 0)
Block(2, 1)
Block (1, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
47/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Motion Compensation in Direct3D• Put video picture in textures• Use vertices to represent a macroblock• Let texture coordinate point to the texture
• Full-pel motion compensation– manipulate texture coordinates
• Multiple pixel shaders fill macroblocks and interpolate
[0.50,0.30]
[0.60,0.30]
[0.50,0.40]
[0.60,0.40]
[0.50,0.40]
[0.60,0.40]
[0.50,0.50]
[0.60,0.50] Reference texture for
rasterization process
48/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Interpolation Strategies for Sub-pixel MC
Vertex Grid
Viewable area
+
+
VertexShaders Pixel
Shader 1Full-Pel
Half-Pel
Q-Pel
PixelShader 2
PixelShader 3
49/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Experimental Results• GPU algorithm scores faster than CPU algorithm• CPU is offloaded, free for other tasks
Motion
Compe
nsatio
n
Debloc
king
Color S
pace
Conver
sion
0123
720p1080p
… times faster
50/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Conclusions• GPU is an attractive platform for general-purpose
computation– flexible, powerful, inexpensive
• General-purpose APIs– approach the GPU as a super-threaded co-processor
• GPGPU requires lots of parallel jobs– e.g. hundreds to thousands
• GPGPU allow faster execution while offloading the CPU– e.g. decoding of H.264/AVC bitstreams
• GPGPU techniques are suited for future architectures
51/50
ELIS – Multimedia Lab
GPGPUBart Pieters - MMLab
Gastcollege OMMT – 28/11/2008
Questions?• Multimedia Lab
– http://multimedialab.elis.ugent.be/GPU– [email protected]