Upload
hoangtram
View
231
Download
2
Embed Size (px)
Citation preview
1
General Purpose Graphics Processing Units (GPGPUs)
Lecture notes from MKP, J. Wang, and S. Yalamanchili
(2)
Overview & Reading
• Understand the multi-threaded execution model of modern general purpose graphics processing units (GPUs)
• Basic architectural organization so we can understand sources of performance and energy efficiency
• Reading: Section 6.6
2
(3)
What is a GPGPU?
• Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel)v Many-core Architecturev Massively Data-Parallel Processor (Compared with a
CPU)v Highly Multi-threaded
• GPGPU: v General-Purpose GPU, High Performance Computingv Become popular with CUDA and OpenCL
programming languages
(4)
Motivation
• High Throughput and Memory Bandwidth
3
(5)
Discrete GPUs in the System
(6)
Fused GPUs: AMD & Intel
On-Chip and sharing the cache
Not as powerful as the discrete GPUs
4
(7)
Core Count: NVIDIA
1536 cores at 1GHz
• All cores are not created equal
• Need to understand the programming model
(8)
GPU Architectures (NVIDIA Tesla)Streaming
multiprocessor
8 × Streamingprocessors
5
(9)
NVIDIA GK110 Architectures
(10)
CUDA Programming Model
• NVIDIA
• Compute Unified Device Architecture (CUDA)
• Kernel: C-like function executed on GPU
• SIMD or SIMTv Single Instruction Multiple Data/thread (SIMD, SIMT)v All threads execute the same instructionv But on its own datav Lock Step
Inst 0
Thread0 1 2 3 4 5 6 7
Data
DataInst 1
6
(11)
Block
CUDA Thread Hierarchy
• Each thread uses IDs to decide what data to work onv 3-dimensionv Hierarchy:
Thread, Block, Grid
0 1 2 30,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
0,0,0 0,0,1 0,0,2 0,0,3
0,1,0 0,1,1 0,1,2 0,1,3
0,2,0 0,2,1 0,2,2 0,2,3
0,3,0 0,3,1 0,3,2 0,3,3
1,0,0 1,0,1 1,0,2 1,0,3
Grid
Block(0,0,0)
Block(0,0,1)
Block(0,1,0)
Block(0,1,1)
Grid
Block(0,0,0)
Block(0,0,1)
Block(0,1,0)
Block(0,1,1)
Grid
Block(0,0,0)
Block(0,0,1)
Block(0,1,0)
Block(0,1,1)
Thread
Kernel 0 Kernel 1 Kernel 2
(12)
Vector Addition
• Let’s assume N=16, blockDim=4 à 4 blocks
blockIdx.x = 0blockDim.x = 4threadIdx.x = 0,1,2,3Idx= 0,1,2,3
blockIdx.x = 1blockDim.x = 4threadIdx.x = 0,1,2,3Idx= 4,5,6,7
+
blockIdx.x = 2blockDim.x = 4threadIdx.x = 0,1,2,3Idx= 8,9,10,11
blockIdx.x = 3blockDim.x = 4threadIdx.x = 0,1,2,3Idx= 12,13,14,15
+ + + +
for (int index = 0; index < N; ++index) { c[index] = a[index] + b[index];
}
7
(13)
Vector Addition
void vector_add( float *a, float* b, float *c, int N) {
for (int index = 0; index < N; ++index)
c[index] = a[index] + b[index];
} }
int main () { vector_add(a, b, c, N);
}
__global__ vector_add( float *a, float *b, float *c, int N) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N) c[index] = a[index]+b[index];
}
int main() {dim3 dimBlock( blocksize, blocksize) ; dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);add_matrix<<<dimGrid, dimBlock>>>( a, b, c,
N); }
CPU Program GPU Program Kernel
(14)
GPU Architecture Basics
……
FPUnit
INTUnit
CUDA Core
EX
MEM
WB
The SI in SIMT
In-order Core
8
(15)
Execution of a CUDA Program
• Blocks are scheduled and executed independently on SMs
• All blocks share memory
(16)
Executing a Block of Threads
• Execution Unit: Warp v a group of threads (32 for NVIDIA GPUs)
• Blocks are partitioned into warps with consecutive thread ID.
SM
Warp 0
Warp 1
Warp 2
Warp 3
Block 0128 Threads
Block 1128 Threads
Warp 0
Warp 1
Warp 2
Warp 3
9
(17)
TT T TTT T T
Warp Execution• A warp executes one common instruction at a
time
• Threads in a warp are mapped to CUDA cores
• Warps are switched and executed on SM
TT T T
Warp Execution
One warp
One warp
One warp
Inst 1Inst 2Inst 3
SM
(18)
Handling Branches
• CUDA Code:
if(…) … (True for some threads)
else … (True for others)
• What if threads takes different branches?
• Branch Divergence!
TT T T
taken not taken
10
(19)
Branch Divergence
• Occurs within a warp• All branch conditions are serialized and will be
executedv Performance issue: low warp utilization
if(…)
{… }
else {…}
Idle threads
(20)
Vector Addition
• N = 60
• 64 Threads, 1 block
• Q: Is there any branch divergence? In which warp?
__global__ vector_add( float *a, float *b, float *c, int N) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N) c[index] = a[index]+b[index];
}
11
(21)
Example: VectorAdd on GPU
__global__ vector_add( float *a, float *b, float *c, int
N) {int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N) c[index] = a[index]+b[index];
}
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8 = &c[index]
L2:ret;
PTX (Assembly):CUDA:
(22)
Example: VectorAdd on GPU
• N=8, 8 Threads, 1 block, warp size = 4
• 1 SM, 4 Cores
• Pipeline:v Fetch:
o One instruction from each warpo Round-robin through all warps
v Execution:o In-order execution within warps
v With proper data forwardingv 1 Cycle each stage
• How many warps?
12
(23)
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
Execution Sequence
(24)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W0 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
13
(25)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W1
setp W0
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
(26)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra W0
setp W1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
14
(27)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
@p bra W1
@p bra W0
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
(28)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2
@p bra W1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
15
(29)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
Execution Sequence (cont.)
(30)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
Execution Sequence (cont.)
16
(31)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBbra W1 bra W1 bra W1 bra W1
ld W0
Execution Sequence (cont.)
(32)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
17
(33)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W0ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
(34)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W0 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1
ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
18
(35)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W1 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1add W0
Execution Sequence (cont.)
(36)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W0 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1
add W1add W0 add W0 add W0 add W0
Execution Sequence (cont.)
19
(37)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W1 FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
add W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1st W0
Execution Sequence (cont.)
(38)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1
st W1st W0 st W0 st W0 st W0
Execution Sequence (cont.)
20
(39)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W1 add W1 add W1 add W1
ret
st W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
Execution Sequence (cont.)
(40)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
ret ret ret ret
Execution Sequence (cont.)
21
(41)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W1 st W1 st W1 st W1
ret ret ret ret
ret ret ret ret
Execution Sequence (cont.)
(42)
Warp0 Warp1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
ret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
22
(43)
Warp0 Warp1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
(44)
Warp0 Warp1
FE
DE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
23
(45)
Study Guide
• Be able to define the terms thread block, warp, and SIMT with examples
• Understand the Vector Addition Example in enough detail tov Know what operations are in each core at any cyclev Given a number of pipeline stages in each core know
how many warps are required to fill the pipelines?v How many instructions are executed in total?
• Key differences between fused and discrete GPUs
(46)
Glossary
• CUDA
• Branch divergence
• Kernel
• OpenCL
• Stream Multiprocessor
• Thread block
• Warp