Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

1

1

ECE 8823A

GPU Architectures

Module 3: CUDA Execution Model -I

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Objective

• A more detailed look at kernel execution– Data to thread assignment

• To understand the organization and scheduling of threads– Resource assignment at the block level– Scheduling at the warp level– Basics of SIMT execution


2

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 4

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3

• Reference: CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-

guide/#abstract

3

host device

Kernel 1

Grid 1Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Kernel 2

Grid 2Block (1,1)

Thread

(0,0,0)Thread(0,1,3)

Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

A Multi-Dimensional Grid Example


3

Built-In Variables

• 1D-3D Grid of thread blocks– Built-in: gridDim

• gridDim.x, gridDim.y, gridDim.z– Built-in: blockDim

• blockDim.x, blockDim.y, blockDim.z

Example• dim3 dimGrid (32,2,2) - 3D grid of thread blocks• dim3 dimGrid (2,2,1) - 2D grid of thread blocks• dim3 dimGrid (32,1,1) - 1D grid of thread blocks• dim3 dimGrid ((n/256.0),1,1) - 1D grid of thread blocks

my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Built-In Variables (2)

• 1D-3D grid of threads in a thread block– Built-in: blockIdx

• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx

• threadIDx.x, threadIDx.y, threadIDX.z– All blocks have the same thread configuration

Example• dim3 dimBlock (4,2,2) - 3D grid of thread blocks• dim3 dimBlock (2,2,1) - 2D grid of thread blocks• dim3 dimBlock (32,1,1) - 1D grid of thread blocks

my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

4

Built-In Variables (3)

• 1D-3D grid of threads in a thread block– Built-in: blockIdx

• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx

• threadIDx.x, threadIDx.y, threadIDX.z

• Initialized by the runtime through a kernel call• Range fixed by the compute capability and target

devices– You can query the device (later)


2D Examples

5

16×16 blocks

Processing a Picture with a 2D Grid

72x62 pixels


M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15

M

Row-Major Layout in C/C++


6

Source Code of the Picture Kernel

11

__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n,intm) {

// Calculate the row # of the d_Pin and d_Pout element to processint Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element to processint Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in rangeif ((Row < m) && (Col < n)) {

d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];}

}


Figure 4.5 Covering a 76×62 picture with 16×blocks.

• Storage layout of data

• Assign unique ID

• Map IDs to Data (access)

Approach Summary


7

A Simple Running ExampleMatrix Multiplication

• A simple illustration of the basic features of memory and thread management in CUDA programs– Thread index usage– Memory layout– Register usage– Assume square matrix for simplicity– Leave shared memory usage until later



Square Matrix-Matrix Multiplication• P = M * N of size WIDTH x WIDTH

– Each thread calculates one element of P

– Each row of M is loaded WIDTH times from global memory

– Each column of N is loaded WIDTH times from global memory

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

8

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15

M

Row-Major Layout in C/C++



Matrix MultiplicationA Simple Host Version in C

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width) {

for (int i = 0; i < Width; ++i)for (int j = 0; j < Width; ++j) {map

double sum = 0;for (int k = 0; k < Width; ++k)

double a = M[i * Width + k];double b = N[k * Width + j];sum += a * b;

}P[i * Width + j] = sum;

}}

i

k

k

j

9


Kernel Version: Functional Description

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *

d_N[k*Width+Col];

i

k

k

j

blockDim.y

blockDim.x

col

row

• Which thread is at coordinate (row,col)?

• Threads self allocate

Kernel Function - A Small Example

• Have each 2D thread block to compute a (TILE_WIDTH)2

sub-matrix (tile) of the result matrix– Each has (TILE_WIDTH)2 threads

• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P2,3 P3,3P3,1

Block(0,0) Block(0,1)

Block(1,1)Block(1,0)

WIDTH = 4; TILE_WIDTH = 2Each block has 2*2 = 4 threads

WIDTH/TILE_WIDTH = 2Use 2* 2 = 4 blocks

10

A Slightly Bigger Example


P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

P0,5P0,4

P1,4

P0,6 P0,7

P1,5

P2,4 P2,6 P2,7P2,5

P1,7P1,6

P3,4 P3,6 P3,7P3,5

P4,1P4,0

P5,0

P4,2 P4,3

P5,1

P6,0 P6,2 P6,3P6,1

P5,3P5,2

P7,0 P7,2 P7,3P7,1

P4,5P4,4

P5,4

P4,6 P4,7

P5,5

P6,4 P6,6 P6,7P6,5

P5,7P5,6

P7,4 P7,6 P7,7P7,5

WIDTH = 8; TILE_WIDTH = 2Each block has 2*2 = 4 threads


A Slightly Bigger Example (cont.)


WIDTH = 8; TILE_WIDTH = 4Each block has 4*4 =16 threads


P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

P0,5P0,4

P1,4

P0,6 P0,7

P1,5

P2,4 P2,6 P2,7P2,5

P1,7P1,6

P3,4 P3,6 P3,7P3,5

P4,1P4,0

P5,0

P4,2 P4,3

P5,1

P6,0 P6,2 P6,3P6,1

P5,3P5,2

P7,0 P7,2 P7,3P7,1

P4,5P4,4

P5,4

P4,6 P4,7

P5,5

P6,4 P6,6 P6,7P6,5

P5,7P5,6

P7,4 P7,6 P7,7P7,5

11


// Setup the execution configuration// TILE_WIDTH is a #define constant

dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1);dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Kernel Invocation (Host-side Code)


Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) {

// Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0;

12

Col = 0 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y

Col = 0

Col = 1

Work for Block (0,0)in a TILE_WIDTH = 2 Configuration


P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

M0,1M0,0

M1,0

M0,2 M0,3

M1,1

M2,0 M2,2 M2,3M2,1

M1,3M1,2

M3,0 M3,2 M3,3M3,1

N0,1N0,0

N1,0

N0,2 N0,3

N1,1

N2,0 N2,2 N2,3N2,1

N1,3N1,2

N3,0 N3,2 N3,3N3,1

Row = 0Row = 1

blockIdx.x blockIdx.y

blockDim.x blockDim.y

Work for Block (0,1)


P0,1P0,0

P0,1

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

Row = 0Row = 1

Col = 2

Col = 3

Col = 1 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y

blockIdx.x blockIdx.y

M0,1M0,0

M1,0

M0,2 M0,3

M1,1

M2,0 M2,2 M2,3M2,1

M1,3M1,2

M3,0 M3,2 M3,3M3,1

N0,1N0,0

N1,0

N0,2 N0,3

N1,1

N2,0 N2,2 N2,3N2,1

N1,3N1,2

N3,0 N2,3 N3,3N3,1

blockDim.x blockDim.y

13

A Simple Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the d_P element and d_M

int Row = blockIdx.y*blockDim.y+threadIdx.y;// Calculate the column idenx of d_P and d_N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *

d_N[k*Width+Col];d_P[Row*Width+Col] = Pvalue;}

}© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

QUESTIONS?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign26

Documents

Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)