GPU Memory II - Virginia Techaccel.cs.vt.edu › files › lecture11.pdfGPU Memory II St t d B k C fli tStructs and Bank Conflicts ¾Struct assignments compile into as ma ny memory

GPU Memory II

GPU Memory IIGPU Memory II─ Memory Hardware and Bank Conflict

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory II

CUDA Device Memory Space: ReviewCUDA Device Memory Space: ReviewEach thread can: (Device) Grid

Bl k (0 0) Bl k (1 0)R/W per-thread registersR/W per-thread local memoryR/W per-block shared memory

Block (0, 0)

Shared Memory

R i t R i t

Block (1, 0)

Shared Memory

R i t R i t/ pe b oc s a ed e o yR/W per-grid global memoryRead only per-grid constant memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

memoryRead only per-grid texture memory

LocalMemory

LocalMemory

LocalMemory

LocalMemory

ConstantMemory

GlobalMemory

Host

• The host can R/W global, constant, and


2TextureMemory

texture memories

GPU Memory II

P ll l M Sh i gP ll l M Sh i gParallel Memory SharingParallel Memory SharingLocal Memory: per-thread

Private per threadThread

Private per threadAuto variables, register spill

Shared Memory: per-BlockShared by threads of the same block

Local Memory

BlockShared by threads of the same blockInter-thread communication

Global Memory: per-applicationShared by all threads

SharedMemory

yInter-Grid communication

Grid 0

. . . Global

MemoryGrid 1

SequentialGridsin Time


. . .1

GPU Memory II

H d O iH d O iHardware OverviewHardware OverviewStreaming Processor Array

TPC TPC TPC TPC TPC TPC TPC TPC

Thread Processor ClusterStreaming Multiprocessor

Instruction Fetch/Dispatch

Instruction L1 Data L1Thread Processor Cluster

SMShared Memory

TEX

SMSP

SP

SP

SFUSP

SP

SP

SFU

Special Function Unit (SFU)


SP SP

GPU Memory II

R gi t FilR gi t FilRegister FileRegister File

Register File (RF) I$Register File (RF)32 KBProvides 4 operands/clock

L1

MultithreadedInstruction Buffer

Texture pipe can also read/write RF2 SMs share 1 TEX

Load/Store pipe can also

RF

C$L1

SharedMem

Load/Store pipe can also read/write RF Operand Select

MAD SFUMAD SFU


GPU Memory II

P g Vi f R gi t FilP g Vi f R gi t FilProgrammer View of Register FileProgrammer View of Register File

There are 8192 registers in 4 blocks 3 blocks

each SM in G80Registers are dynamically partitioned across all Blocks

i d t th SMassigned to the SMOnce assigned to a Block, the register is NOT accessible by threads in other Blocksthreads in other BlocksEach thread in the same Block only access registers assigned to itself


GPU Memory II

M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example

If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?

Each Block requires 10*256 = 2560 registersq g8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned

How about if each thread increases the use of registers by 1?

Each Block now requires 11*256 = 2816 registersEach Block now requires 11 256 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!


7

p

GPU Memory II

M D i P titi i gM D i P titi i gMore on Dynamic PartitioningMore on Dynamic Partitioning

Dynamic partitioning gives more flexibility to Dynamic partitioning gives more flexibility to compilers/programmers

One can run a smaller number of threads that One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each

This allows for finer grain threading than traditional CPU threading models.

The compiler can tradeoff between instruction-level parallelism and thread level parallelism


8

GPU Memory II

C t tC t tConstantConstant

Immediate address constants I$Indexed address constantsConstants stored in DRAM, and cached on

L1


chipL1 per SM

A constant value can be broadcast to all

RF

C$L1

SharedMem

A constant value can be broadcast to all threads in a Warp

Extremely efficient way of accessing a value th t i f ll th d i Bl k!

Operand Select

MAD SFUthat is common for all threads in a Block! MAD SFU


GPU Memory II

Sh d MSh d MShared MemoryShared Memory

Each Multi-processor has 16 KB of I$pShared Memory

16 banks of 32bit wordsWill discuss about accessing pattern later

L1


Will discuss about accessing pattern later

Visible to all threads in a thread blockread and write access

RF

C$L1

SharedMem

Operand Select

MAD SFUMAD SFU


GPU Memory II

M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example

Explore Tile based implementation with Explore Tile-based implementation with Shared Memory.Question:Question:

How is shared memory organized?What are the issues when accessing shared What are the issues when accessing shared memory?


GPU Memory II

Til B d M lti li tiTil B d M lti li tibx

Tile Based MultiplicationTile Based Multiplication

One block computes one square sub-tx

01 bsize-12

0 1 2

matrix Psub of size BLOCK_SIZEOne thread computes one element of P

N

BL

OC

K_S

IZE

ght

PsubAssume that the dimensions of M and N are multiples of BLOCK_SIZE and square

BB

LO

CK

_SIZ

E N.h

eig

shape M P

P0

0

EPsub

BLOCK SIZEBLOCK SIZEBLOCK SIZE

byty

210

bsize-1

1

BL

OC

K_S

IZE

M.h

eigh

t


13

BLOCK_SIZE

N.widthM.width

BLOCK_SIZEBLOCK_SIZE

2

GPU Memory II

Til d M t i M lti li ti K l Til d M t i M lti li ti K l R iR iTiled Matrix Multiplication Kernel Tiled Matrix Multiplication Kernel ---- ReviewReview__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){{1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;

5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14. Synchthreads();15 }


15. }16. Pd[Row*Width+Col] = Pvalue;}

GPU Memory II

Sh d M Sh d M O g i tiO g i tiShared Memory Shared Memory OrganizationOrganizationParallel Memory Architecture:

M i di id d i t b kBank 0

Memory is divided into banksEssential to achieve high bandwidth

Each bank can service one address per cycle Bank 4Bank 3Bank 2Bank 1

A memory can service as many simultaneous accesses as it has banks

Multiple simultaneous accesses to a bank Bank 7Bank 6Bank 5Bank 4

Multiple simultaneous accesses to a bankresult in a bank conflict

Conflicting accesses are serialized Bank 15


16

GPU Memory II

Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue

No Bank Conflicts No Bank ConflictsLinear addressing stride == 1

Random 1:1 Permutation

B k 0Th d 0 B k 0Th d 0

Bank 3Bank 2Bank 1Bank 0

Thread 3Thread 2Thread 1Thread 0







Bank 15

Bank 7

Thread 15

Thread 7

Bank 15

Bank 7

Thread 15

Thread 7


Bank 15Thread 15 Bank 15Thread 15

GPU Memory II

Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue

2-way Bank Conflicts 8-way Bank ConflictsLinear addressing stride == 2

Linear addressing stride == 8

Thread 0 Bank 0 Thread 0 Bank 0x8

Th d 4Thread 3Thread 2Thread 1Thread 0

B k 4Bank 3Bank 2Bank 1Bank 0


Bank 2Bank 1Bank 0

Thread 4



Bank 9Bank 8Bank 7


Bank 15

Bank 7

Thread 15

Thread 7 Bank 9

Bank 15

x8


Thread 11 Bank 15 Thread 15 Bank 15

GPU Memory II

How addresses map to banks How addresses map to banks in CUDAin CUDAEach bank has a bandwidth of 32 bits per clock cycleSuccessive 32-bit words are assigned to successive banksG80 h 16 b kG80 has 16 banks

So bank = address % 16Same as the size of a half-warp

No bank conflicts between different half-warps, only within a single half-warp


19

GPU Memory II

Sh M P fSh M P fShare Memory PerformanceShare Memory PerformanceShared memory is as fast as registers if there are no b k fli tbank conflictsThe fast case:

If all threads of a half-warp access different banks, there is no If all threads of a half warp access different banks, there is no bank conflictIf all threads of a half-warp access the identical address, there is no bank conflict (broadcast)is no bank conflict (broadcast)

The slow case:Bank Conflict: multiple threads in the same half-warp access th b kthe same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bank


20

GPU Memory II

Li Li Add i g (1D)Add i g (1D)Linear Linear Addressing (1D)Addressing (1D)Given:



s=1

__shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];

Bank 7Bank 6Bank 5Bank 4Bank 3

Thread 7Thread 6Thread 5Thread 4Thread 3

This is only bank-conflict-free if s shares no common factors with the number of

Bank 15Thread 15

banks 16 on G80, so s must be odd

Bank 2Bank 1Bank 0

Thread 2Thread 1Thread 0

s=3




21Bank 15Thread 15

GPU Memory II

D t t d b k fli tD t t d b k fli tData types and bank conflictsData types and bank conflictsThis has no conflicts if type of shared is 32-bits:

foo = shared[baseIndex + threadIdx.x]



But not if the data type is smaller4-way bank conflicts:

__shared__ char shared[];B k 15

Bank 7Bank 6Bank 5

Th d 15

Thread 7Thread 6Thread 5

foo = shared[baseIndex + threadIdx.x];Bank 15Thread 15

Bank 1Bank 0

Thread 1Thread 0

2-way bank conflicts:__shared__ short shared[];foo = shared[baseIndex + threadIdx.x];

Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1

Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1


22Bank 15

Bank 7

Thread 15

Thread 7

GPU Memory II

St t d B k C fli tSt t d B k C fli tStructs and Bank ConflictsStructs and Bank ConflictsStruct assignments compile into as many memory accesses as there are struct members:

struct vector { float x, y, z; };struct myType {

float f; B k 4Bank 3Bank 2Bank 1Bank 0


int c;};__shared__ struct vector vectors[64];shared struct myType myTypes[64];



__shared__ struct myType myTypes[64];

This has no bank conflicts for vector; struct size is 3 words3 accesses per thread, contiguous banks (no common factor with 16)

Bank 15Thread 15

struct vector v = vectors[baseIndex + threadIdx.x];

This has 2-way bank conflicts for my Type; (2 accesses per thread)struct myType m = myTypes[baseIndex + threadIdx.x];


23

GPU Memory II

Common Array Bank Conflict Common Array Bank Conflict Patterns 1DPatterns 1D

Each thread loads 2 elements into shared memory:

2-way-interleaved loads result in 2-way bank conflicts:

Th d 2

Thread 1

Thread 0

B k 2

Bank 1

Bank 0

int tid = threadIdx.x;shared[2*tid] = global[2*tid];h d[2*tid+1] l b l[2*tid+1]

Thread 4

Thread 3

Thread 2

Bank 5

Bank 4

Bank 3

Bank 2

shared[2*tid+1] = global[2*tid+1];

This makes sense for traditional CPU th d l lit i h li g d

Thread 9

Thread 8Bank 7

Bank 6

threads, locality in cache line usage and reduced sharing traffic.

Not in shared memory usage where there is no cache line effects but banking effects

Thread 11

Thread 10

Bank 15


24

no cache line effects but banking effects

GPU Memory II

A B tt A A P ttA B tt A A P ttA Better Array Access PatternA Better Array Access PatternEach thread loads one element in every consecutive group of blockDimevery consecutive group of blockDimelements.

Bank 2

Bank 1

Bank 0

Thread 2

Thread 1

Thread 0

shared[tid] = global[tid];shared[tid + blockDim.x] = global[tid + blockDim.x];

B k 5

Bank 4

Bank 3

Bank 2

Th d 5

Thread 4

Thread 3

Thread 2

Bank 7

Bank 6

Bank 5

Thread 7

Thread 6

Thread 5

Bank 15Thread 15


25

Bank 15Thread 15

GPU Memory II

Common Bank Conflict Patterns (2D)Common Bank Conflict Patterns (2D)Operating on 2D array of floats in shared memory

Bank Indices without Padding0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15memory

e.g. image processingExample: 16x16 block

Each thread processes a row

0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15p

So threads in a block access the elements in each column simultaneously (example: row 1 in purple)16 way bank conflicts: rows all start at bank 0

0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 1516-way bank conflicts: rows all start at bank 0

Solution 1) pad the rowsAdd one float to the end of each row

0 1 2 3 4 5 6 7 151 2 3 4 5 6 7 8 02 3 4 5 6 7 8 9 1

012

Bank Indices with Padding

Add one float to the end of each rowSolution 2) transpose before processing

Suffer bank conflicts during transpose

3 4 5 6 7 8 9 10 24 5 6 7 8 9 1011 35 6 7 8 9 101112 46 7 8 9 10111213 57 8 9 1011121314 7

34568

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes26

15 0 1 2 3 4 5 6 1415

GPU Memory II

Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()

Nd

IDT

H

14. Synchthreads();15. }

TIL

E_W

I_W

IDT

H WID

TH

TIL

E_

Md Pd

Pdsub

TIL

E_W

IDT

HE

WID

TH


TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

GPU Memory II

Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]


GPU Memory II



0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

For TILE_WIDTH = 16The whole half-warp is accessing the

same shared memory location.C fli t B t GPU t b d ti0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 15

Conflict. But, GPU support broadcasting.


0 1 2 3 4 5 6 7 15

GPU Memory II



0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15

For TILE_WIDTH = 8The first half-warp and the second half-

warp are accessing two different shared l ti8 9 10 11 12 13 14 150 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

memory location.8-way bank conflict.


GPU Memory II



0 1 2 34 5 6 78 9 10 11

12 13 14 15For TILE_WIDTH = 4

4-way bank conflict.


GPU Memory II



0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

For TILE_WIDTH = 16Each thread in a half-warp is accessing

different shared memory location.0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 15

yNo conflict.


0 1 2 3 4 5 6 7 15

GPU Memory II



For TILE_WIDTH = 8Since the memory storage organization is

row-major for 2D array, so it’s the same with

0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 7j y

TILE_WIDTH = 16.No conflict.

0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15


Documents

GPU Memory II - Virginia Techaccel.cs.vt.edu › files › lecture11.pdfGPU Memory II St t d B k C fli tStructs and Bank Conflicts ¾Struct assignments compile into as ma ny memory