37
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Embed Size (px)

Citation preview

Page 1: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

1

ECE 8823A

GPU Architectures

Module 5: Execution and Resources - I

Page 2: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6

• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-guide/

#abstract

2

Page 3: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Objective

• To understand the implications of programming model constructs on demand for execution resources

• To be able to reason about performance consequences of programming model parameters– Thread blocks, warps, memory behaviors, etc.– Need deeper understanding of architecture to be really valuable

(later)

• To understand DRAM bandwidth– Cause of the DRAM bandwidth problem– Programming techniques that address the problem: memory

coalescing, corner turning,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

3

Page 4: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Formation of Warps

4

• How do you form warps out of multidimensional arrays of threads?– Linearize thread IDs

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2)

(1,0,3)

warp

1D Thread Block

3D Thread Block

Page 5: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Formation of Warps

5

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2)

(1,0,3)

T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3

linear order

2D Thread Block3D Thread Block

Page 6: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Execution of Warps

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

6

• Each warp executed as SIMD bundle• How do we handle divergent control flow among

threads in a warp? – Execution semantics– How is it implemented? (later)

warp

Page 7: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reduction: Approach 1

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

7

1. __shared__ float partialsum[]; ..2. unsigned int t = threadIDx.x;3. For (unsigned int stride =1; stride <blockDim.x; stride *=2)4. {5. __syncthread();6. If(t%(2*stride) == 0)7. partialsum[t] +=partialsum[t+stride];8. }

0 1 2 43 5 66

0+1 2+3 4+5 6+7

0..3 4..7

0..7

threadID.x

Thread Block

Page 8: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reduction: Approach 2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

8

1. __shared__ float partialsum[]; ..2. unsigned int t = threadIDx.x;3. For (unsigned int stride = blockDim.x; stride>1; stride /=2)4. {5. __syncthread();6. If(t < stride)7. partialsum[t] +=partialsum[t+stride];8. }

• Difference is in which threads diverge!• For a thread block of 512 threads

– Threads 0-255 take the branch, 256-511 do not

• For a warp size of 32, all threads in a warp have identical branch conditions no divergence!

• When #active threads <warp-size, old problem

Page 9: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Global Memory (DRAM) Bandwidth

Ideal Reality

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

9

Page 10: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

RowAddr

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

DRAM Bank Organization

• Each core array has about 1M bits

• Each bit is stored in a tiny capacitor, made of one transistor

Memory CellCore Array

RowDecoder

Sense Amps

Column Latches

MuxColumnAddr

Off-chip Data

Wide

Narrow

Pin Interface

10

Page 11: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

A very small (8x2 bit) DRAM Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

0 1 1

Sense amps

Mux11

Page 12: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM core arrays are slow.• Reading from a cell in the core array is a very

slow process– DDR: Core speed = ½ interface speed– DDR2/GDDR3: Core speed = ¼ interface speed– DDR3/GDDR4: Core speed = ⅛ interface speed– … likely to be worse in the future

deco

de

To sense amps

A very small capacitance that stores a data bit

About 1000 cells connected to each vertical line

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

12

Page 13: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM Bursting.

• For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:

– Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed

– DDR2/GDDR3: buffer width = 4X interface width

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

13

Page 14: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM Bursting

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

0 1 0

Sense amps

Mux

14

Page 15: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM Bursting

deco

de

0 1 1

Sense amps and buffer

Mux

15

“You can buy bandwidth but you can’t bribe

God” --

Unknown

Page 16: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM Bursting for the 8x2 Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Non-burst timing

Burst timing

Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations.

16

Page 17: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Multiple DRAM Banks

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

deco

de

Sense amps

Muxde

code

Sense amps

Mux

0 1 10

Bank 0 Bank 1

17

Page 18: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

DRAM Bursting for the 8x2 Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

18

Page 19: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

First-order Look at the GPU off-chip memory subsystem

• nVidia GTX280 GPU: – Peak global memory bandwidth = 141.7GB/s

• Global memory (GDDR3) interface @ 1.1GHz– (Core speed @ 276Mhz)– For a typical 64-bit interface, we can sustain only

about 17.6 GB/s (Recall DDR - 2 transfers per clock)– We need a lot more bandwith (141.7 GB/s) – thus 8

memory channels

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

19

Page 20: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Multiple Memory Channels

• Divide the memory address space into N parts– N is number of memory channels– Assign each portion to a channel

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Channel 0

Channel 1

Channel 2

Channel 3

Bank Bank Bank Bank

20

Page 21: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Memory Controller Organization of a Many-Core Processor

• GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect– DRAM controllers are interleaved– Within DRAM controllers (channels), DRAM

banks are interleaved for incoming memory requests

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

21

Page 22: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Lessons

• Organize data accesses to maximize burst mode bandwidth – Access consecutive locations– Algorithmic strategies + data layout

• Thread blocks issue warp-size load/store instructions– 32 addresses in Fermi– Coalesce these accesses to create smaller number of

memory transactions maximize memory bandwidth– More later as we discuss microarchitecture

22

Page 23: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Memory Coalescing

• Memory references are coalesced into sequence of memory transactions– Accesses to a segment are coalesced, e.g., 128 byte

segments)23

LD LD LD LD

Opportunity to Coalesce

16*4= 64 bytes

Warp

Page 24: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Implications of Memory Coalescing

• Reduce the request rate to L1 and DRAM

• Distinct from CPU optimizations – why?

• Need to be able to re-map entries from each access back to threads

24

Warp Schedulers

Register File

SPSPSPSP

SPSPSPSP

SPSPSPSP

SPSPSPSP

L1/Shared Memory

DRAMDRAM

DRAMDRAM

L1 access

bandwidth

DRAM access

bandwidth

Page 25: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

linearized order in increasing address

Placing a 2D C array into linear memory space

Page 26: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Base Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and Mint Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column index of Pd and Nint Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;// each thread computes one element of the block sub-

matrixfor (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];

d_P[Row*Width+Col] = Pvalue;} 26

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Page 27: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Two Access Patterns

27

d_M d_N

WIDTH

WIDTH

Thread 1

Thread 2

(a) (b)

d_M[Row*Width+k] d_N[k*Width+Col]

k is loop counter in the inner product loop of the kernel code

Page 28: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

28

N accesses are coalesced.

N

T0 T1 T2 T3

Load iteration 0

T0 T1 T2 T3

Load iteration 1

Access direction in kernel code (one thread)

N0,2

N1,1

N0,1N0,0

N1,0

N0,3

N1,2 N1,3

N2,1N2,0 N2,2 N2,3

N3,1N3,0 N3,2 N3,3

N0,2N0,1N0,0 N0,3 N1,1N1,0 N1,2 N1,3 N2,1N2,0 N2,2 N2,3 N3,1N3,0 N3,2 N3,3

Across successive threads in a warp

d_N[k*Width+Col]

Page 29: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

M accesses are not coalesced.

29

M

T0 T1 T2 T3

Load iteration 0

T0 T1 T2 T3

Load iteration 1

Access direction in Kernel code (in a thread)

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3 M3,1M3,0 M3,2 M3,3

d_M[Row*Width+k]

Access across successive

threads in a warp

Page 30: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Using Shared Memory

30

d_M d_N

WIDTH

d_M d_N

Original AccessPattern

Tiled AccessPattern

Copy into scratchpad

memory

Perform multiplication

with scratchpad values

WIDTH

Page 31: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Shared Memory Accesses

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 31

• Shared memory is banked – No coalescing

• Data access patterns should be structured to avoid bank conflicts

• Low order interleaved mapping?

Page 32: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;// Identify the row and column of the d_P element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;// Loop over the d_M and d_N tiles required to compute the d_P element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of d_M and d_N tiles into shared memory9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx];10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col];11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[tx][k] * Nds[k][ty];14. __synchthreads(); }15. d_P[Row*Width+Col] = Pvalue; }

• Accesses from shared memory, hence coalescing is not necessary

• Consider bank conflicts

Page 33: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Coalescing Behavior

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

33

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*T

ILE

_WID

TH

m*TILE_WIDTH

Col

Row

Page 34: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Thread Granularity

34

Warp Schedulers

Register File

SPSPSPSP

SPSPSPSP

SPSPSPSP

SPSPSPSP

L1/Shared Memory

DRAMDRAM

DRAMDRAM

• Consider instruction bandwidth vs. memory bandwidth

• Control amount of work per thread

Fetch/Decode

Page 35: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Thread Granularity Tradeoffs

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 35

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*T

ILE

_WID

TH

m*TILE_WIDTH

Col

Row

• Preserving instruction bandwidth (memory bandwidth)– Increase thread granularity– Merge adjacent tiles: sharing tile

data

Page 36: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Thread Granularity Tradeoffs (2)

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 36

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*T

ILE

_WID

TH

m*TILE_WIDTH

Col

Row

• Impact on parallelism– #TBs, #registers/thread– Need to explore impact

autotuning

Page 37: 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ANY MORE QUESTIONS?READ CHAPTER 6!

37