Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GPU Memory II
GPU Memory IIGPU Memory II─ Memory Hardware and Bank Conflict
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
CUDA Device Memory Space: ReviewCUDA Device Memory Space: ReviewEach thread can: (Device) Grid
Bl k (0 0) Bl k (1 0)R/W per-thread registersR/W per-thread local memoryR/W per-block shared memory
Block (0, 0)
Shared Memory
R i t R i t
Block (1, 0)
Shared Memory
R i t R i t/ pe b oc s a ed e o yR/W per-grid global memoryRead only per-grid constant memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Thread (0, 0)
Registers
Thread (1, 0)
Registers
memoryRead only per-grid texture memory
LocalMemory
LocalMemory
LocalMemory
LocalMemory
ConstantMemory
GlobalMemory
Host
• The host can R/W global, constant, and
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
2TextureMemory
texture memories
GPU Memory II
P ll l M Sh i gP ll l M Sh i gParallel Memory SharingParallel Memory SharingLocal Memory: per-thread
Private per threadThread
Private per threadAuto variables, register spill
Shared Memory: per-BlockShared by threads of the same block
Local Memory
BlockShared by threads of the same blockInter-thread communication
Global Memory: per-applicationShared by all threads
SharedMemory
yInter-Grid communication
Grid 0
. . . Global
MemoryGrid 1
SequentialGridsin Time
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
. . .1
GPU Memory II
H d O iH d O iHardware OverviewHardware OverviewStreaming Processor Array
TPC TPC TPC TPC TPC TPC TPC TPC
Thread Processor ClusterStreaming Multiprocessor
Instruction Fetch/Dispatch
Instruction L1 Data L1Thread Processor Cluster
SMShared Memory
TEX
SMSP
SP
SP
SFUSP
SP
SP
SFU
Special Function Unit (SFU)
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
SP SP
GPU Memory II
R gi t FilR gi t FilRegister FileRegister File
Register File (RF) I$Register File (RF)32 KBProvides 4 operands/clock
L1
MultithreadedInstruction Buffer
Texture pipe can also read/write RF2 SMs share 1 TEX
Load/Store pipe can also
RF
C$L1
SharedMem
Load/Store pipe can also read/write RF Operand Select
MAD SFUMAD SFU
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
P g Vi f R gi t FilP g Vi f R gi t FilProgrammer View of Register FileProgrammer View of Register File
There are 8192 registers in 4 blocks 3 blocks
each SM in G80Registers are dynamically partitioned across all Blocks
i d t th SMassigned to the SMOnce assigned to a Block, the register is NOT accessible by threads in other Blocksthreads in other BlocksEach thread in the same Block only access registers assigned to itself
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
Each Block requires 10*256 = 2560 registersq g8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned
How about if each thread increases the use of registers by 1?
Each Block now requires 11*256 = 2816 registersEach Block now requires 11 256 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
7
p
GPU Memory II
M D i P titi i gM D i P titi i gMore on Dynamic PartitioningMore on Dynamic Partitioning
Dynamic partitioning gives more flexibility to Dynamic partitioning gives more flexibility to compilers/programmers
One can run a smaller number of threads that One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
This allows for finer grain threading than traditional CPU threading models.
The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
8
GPU Memory II
C t tC t tConstantConstant
Immediate address constants I$Indexed address constantsConstants stored in DRAM, and cached on
L1
MultithreadedInstruction Buffer
chipL1 per SM
A constant value can be broadcast to all
RF
C$L1
SharedMem
A constant value can be broadcast to all threads in a Warp
Extremely efficient way of accessing a value th t i f ll th d i Bl k!
Operand Select
MAD SFUthat is common for all threads in a Block! MAD SFU
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
Sh d MSh d MShared MemoryShared Memory
Each Multi-processor has 16 KB of I$pShared Memory
16 banks of 32bit wordsWill discuss about accessing pattern later
L1
MultithreadedInstruction Buffer
Will discuss about accessing pattern later
Visible to all threads in a thread blockread and write access
RF
C$L1
SharedMem
Operand Select
MAD SFUMAD SFU
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example
Explore Tile based implementation with Explore Tile-based implementation with Shared Memory.Question:Question:
How is shared memory organized?What are the issues when accessing shared What are the issues when accessing shared memory?
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
GPU Memory II
Til B d M lti li tiTil B d M lti li tibx
Tile Based MultiplicationTile Based Multiplication
One block computes one square sub-tx
01 bsize-12
0 1 2
matrix Psub of size BLOCK_SIZEOne thread computes one element of P
N
BL
OC
K_S
IZE
ght
PsubAssume that the dimensions of M and N are multiples of BLOCK_SIZE and square
BB
LO
CK
_SIZ
E N.h
eig
shape M P
P0
0
EPsub
BLOCK SIZEBLOCK SIZEBLOCK SIZE
byty
210
bsize-1
1
BL
OC
K_S
IZE
M.h
eigh
t
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
13
BLOCK_SIZE
N.widthM.width
BLOCK_SIZEBLOCK_SIZE
2
GPU Memory II
Til d M t i M lti li ti K l Til d M t i M lti li ti K l R iR iTiled Matrix Multiplication Kernel Tiled Matrix Multiplication Kernel ---- ReviewReview__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){{1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;
5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14. Synchthreads();15 }
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
15. }16. Pd[Row*Width+Col] = Pvalue;}
GPU Memory II
Sh d M Sh d M O g i tiO g i tiShared Memory Shared Memory OrganizationOrganizationParallel Memory Architecture:
M i di id d i t b kBank 0
Memory is divided into banksEssential to achieve high bandwidth
Each bank can service one address per cycle Bank 4Bank 3Bank 2Bank 1
A memory can service as many simultaneous accesses as it has banks
Multiple simultaneous accesses to a bank Bank 7Bank 6Bank 5Bank 4
Multiple simultaneous accesses to a bankresult in a bank conflict
Conflicting accesses are serialized Bank 15
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
16
GPU Memory II
Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue
No Bank Conflicts No Bank ConflictsLinear addressing stride == 1
Random 1:1 Permutation
B k 0Th d 0 B k 0Th d 0
Bank 3Bank 2Bank 1Bank 0
Thread 3Thread 2Thread 1Thread 0
Bank 3Bank 2Bank 1Bank 0
Thread 3Thread 2Thread 1Thread 0
Bank 7Bank 6Bank 5Bank 4
Thread 7Thread 6Thread 5Thread 4
Bank 7Bank 6Bank 5Bank 4
Thread 7Thread 6Thread 5Thread 4
Bank 15
Bank 7
Thread 15
Thread 7
Bank 15
Bank 7
Thread 15
Thread 7
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
Bank 15Thread 15 Bank 15Thread 15
GPU Memory II
Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue
2-way Bank Conflicts 8-way Bank ConflictsLinear addressing stride == 2
Linear addressing stride == 8
Thread 0 Bank 0 Thread 0 Bank 0x8
Th d 4Thread 3Thread 2Thread 1Thread 0
B k 4Bank 3Bank 2Bank 1Bank 0
Th d 4Thread 3Thread 2Thread 1Thread 0
Bank 2Bank 1Bank 0
Thread 4
Bank 7Bank 6Bank 5Bank 4
Thread 7Thread 6Thread 5Thread 4
Bank 9Bank 8Bank 7
Thread 11Thread 10Thread 9Thread 8
Bank 15
Bank 7
Thread 15
Thread 7 Bank 9
Bank 15
x8
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
Thread 11 Bank 15 Thread 15 Bank 15
GPU Memory II
How addresses map to banks How addresses map to banks in CUDAin CUDAEach bank has a bandwidth of 32 bits per clock cycleSuccessive 32-bit words are assigned to successive banksG80 h 16 b kG80 has 16 banks
So bank = address % 16Same as the size of a half-warp
No bank conflicts between different half-warps, only within a single half-warp
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
19
GPU Memory II
Sh M P fSh M P fShare Memory PerformanceShare Memory PerformanceShared memory is as fast as registers if there are no b k fli tbank conflictsThe fast case:
If all threads of a half-warp access different banks, there is no If all threads of a half warp access different banks, there is no bank conflictIf all threads of a half-warp access the identical address, there is no bank conflict (broadcast)is no bank conflict (broadcast)
The slow case:Bank Conflict: multiple threads in the same half-warp access th b kthe same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bank
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
20
GPU Memory II
Li Li Add i g (1D)Add i g (1D)Linear Linear Addressing (1D)Addressing (1D)Given:
Bank 3Bank 2Bank 1Bank 0
Thread 3Thread 2Thread 1Thread 0
s=1
__shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];
Bank 7Bank 6Bank 5Bank 4Bank 3
Thread 7Thread 6Thread 5Thread 4Thread 3
This is only bank-conflict-free if s shares no common factors with the number of
Bank 15Thread 15
banks 16 on G80, so s must be odd
Bank 2Bank 1Bank 0
Thread 2Thread 1Thread 0
s=3
Bank 7Bank 6Bank 5Bank 4Bank 3
Thread 7Thread 6Thread 5Thread 4Thread 3
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
21Bank 15Thread 15
GPU Memory II
D t t d b k fli tD t t d b k fli tData types and bank conflictsData types and bank conflictsThis has no conflicts if type of shared is 32-bits:
foo = shared[baseIndex + threadIdx.x]
Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 4Thread 3Thread 2Thread 1Thread 0
But not if the data type is smaller4-way bank conflicts:
__shared__ char shared[];B k 15
Bank 7Bank 6Bank 5
Th d 15
Thread 7Thread 6Thread 5
foo = shared[baseIndex + threadIdx.x];Bank 15Thread 15
Bank 1Bank 0
Thread 1Thread 0
2-way bank conflicts:__shared__ short shared[];foo = shared[baseIndex + threadIdx.x];
Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1
Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
22Bank 15
Bank 7
Thread 15
Thread 7
GPU Memory II
St t d B k C fli tSt t d B k C fli tStructs and Bank ConflictsStructs and Bank ConflictsStruct assignments compile into as many memory accesses as there are struct members:
struct vector { float x, y, z; };struct myType {
float f; B k 4Bank 3Bank 2Bank 1Bank 0
Th d 4Thread 3Thread 2Thread 1Thread 0
int c;};__shared__ struct vector vectors[64];shared struct myType myTypes[64];
Bank 7Bank 6Bank 5Bank 4
Thread 7Thread 6Thread 5Thread 4
__shared__ struct myType myTypes[64];
This has no bank conflicts for vector; struct size is 3 words3 accesses per thread, contiguous banks (no common factor with 16)
Bank 15Thread 15
struct vector v = vectors[baseIndex + threadIdx.x];
This has 2-way bank conflicts for my Type; (2 accesses per thread)struct myType m = myTypes[baseIndex + threadIdx.x];
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
23
GPU Memory II
Common Array Bank Conflict Common Array Bank Conflict Patterns 1DPatterns 1D
Each thread loads 2 elements into shared memory:
2-way-interleaved loads result in 2-way bank conflicts:
Th d 2
Thread 1
Thread 0
B k 2
Bank 1
Bank 0
int tid = threadIdx.x;shared[2*tid] = global[2*tid];h d[2*tid+1] l b l[2*tid+1]
Thread 4
Thread 3
Thread 2
Bank 5
Bank 4
Bank 3
Bank 2
shared[2*tid+1] = global[2*tid+1];
This makes sense for traditional CPU th d l lit i h li g d
Thread 9
Thread 8Bank 7
Bank 6
threads, locality in cache line usage and reduced sharing traffic.
Not in shared memory usage where there is no cache line effects but banking effects
Thread 11
Thread 10
Bank 15
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
24
no cache line effects but banking effects
GPU Memory II
A B tt A A P ttA B tt A A P ttA Better Array Access PatternA Better Array Access PatternEach thread loads one element in every consecutive group of blockDimevery consecutive group of blockDimelements.
Bank 2
Bank 1
Bank 0
Thread 2
Thread 1
Thread 0
shared[tid] = global[tid];shared[tid + blockDim.x] = global[tid + blockDim.x];
B k 5
Bank 4
Bank 3
Bank 2
Th d 5
Thread 4
Thread 3
Thread 2
Bank 7
Bank 6
Bank 5
Thread 7
Thread 6
Thread 5
Bank 15Thread 15
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes
25
Bank 15Thread 15
GPU Memory II
Common Bank Conflict Patterns (2D)Common Bank Conflict Patterns (2D)Operating on 2D array of floats in shared memory
Bank Indices without Padding0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15memory
e.g. image processingExample: 16x16 block
Each thread processes a row
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15p
So threads in a block access the elements in each column simultaneously (example: row 1 in purple)16 way bank conflicts: rows all start at bank 0
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 1516-way bank conflicts: rows all start at bank 0
Solution 1) pad the rowsAdd one float to the end of each row
0 1 2 3 4 5 6 7 151 2 3 4 5 6 7 8 02 3 4 5 6 7 8 9 1
012
Bank Indices with Padding
Add one float to the end of each rowSolution 2) transpose before processing
Suffer bank conflicts during transpose
3 4 5 6 7 8 9 10 24 5 6 7 8 9 1011 35 6 7 8 9 101112 46 7 8 9 10111213 57 8 9 1011121314 7
34568
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes26
15 0 1 2 3 4 5 6 1415
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()
Nd
IDT
H
14. Synchthreads();15. }
TIL
E_W
I_W
IDT
H WID
TH
TIL
E_
Md Pd
Pdsub
TIL
E_W
IDT
HE
WID
TH
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes27
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes28
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
For TILE_WIDTH = 16The whole half-warp is accessing the
same shared memory location.C fli t B t GPU t b d ti0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
Conflict. But, GPU support broadcasting.
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes29
0 1 2 3 4 5 6 7 15
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15
For TILE_WIDTH = 8The first half-warp and the second half-
warp are accessing two different shared l ti8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
memory location.8-way bank conflict.
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes30
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
0 1 2 34 5 6 78 9 10 11
12 13 14 15For TILE_WIDTH = 4
4-way bank conflict.
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes31
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
For TILE_WIDTH = 16Each thread in a half-warp is accessing
different shared memory location.0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
yNo conflict.
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes32
0 1 2 3 4 5 6 7 15
GPU Memory II
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }
Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]
For TILE_WIDTH = 8Since the memory storage organization is
row-major for 2D array, so it’s the same with
0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 7j y
TILE_WIDTH = 16.No conflict.
0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15
Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes33