13
1 1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIMT execution © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

  • Upload
    hangoc

  • View
    238

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

1

1

ECE 8823A

GPU Architectures

Module 3: CUDA Execution Model -I

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Objective

• A more detailed look at kernel execution– Data to thread assignment

• To understand the organization and scheduling of threads– Resource assignment at the block level– Scheduling at the warp level– Basics of SIMT execution

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Page 2: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

2

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 4

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3

• Reference: CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-

guide/#abstract

3

host device

Kernel 1

Grid 1Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Kernel 2

Grid 2Block (1,1)

Thread

(0,0,0)Thread(0,1,3)

Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

A Multi-Dimensional Grid Example

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Page 3: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

3

Built-In Variables

• 1D-3D Grid of thread blocks– Built-in: gridDim

• gridDim.x, gridDim.y, gridDim.z– Built-in: blockDim

• blockDim.x, blockDim.y, blockDim.z

Example• dim3 dimGrid (32,2,2) - 3D grid of thread blocks• dim3 dimGrid (2,2,1) - 2D grid of thread blocks• dim3 dimGrid (32,1,1) - 1D grid of thread blocks• dim3 dimGrid ((n/256.0),1,1) - 1D grid of thread blocks

my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Built-In Variables (2)

• 1D-3D grid of threads in a thread block– Built-in: blockIdx

• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx

• threadIDx.x, threadIDx.y, threadIDX.z– All blocks have the same thread configuration

Example• dim3 dimBlock (4,2,2) - 3D grid of thread blocks• dim3 dimBlock (2,2,1) - 2D grid of thread blocks• dim3 dimBlock (32,1,1) - 1D grid of thread blocks

my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Page 4: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

4

Built-In Variables (3)

• 1D-3D grid of threads in a thread block– Built-in: blockIdx

• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx

• threadIDx.x, threadIDx.y, threadIDX.z

• Initialized by the runtime through a kernel call• Range fixed by the compute capability and target

devices– You can query the device (later)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

2D Examples

Page 5: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

5

16×16 blocks

Processing a Picture with a 2D Grid

72x62 pixels

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15

M

Row-Major Layout in C/C++

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Page 6: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

6

Source Code of the Picture Kernel

11

__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n,intm) {

// Calculate the row # of the d_Pin and d_Pout element to processint Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element to processint Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in rangeif ((Row < m) && (Col < n)) {

d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];}

}

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Figure 4.5 Covering a 76×62 picture with 16×blocks.

• Storage layout of data

• Assign unique ID

• Map IDs to Data (access)

Approach Summary

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Page 7: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

7

A Simple Running ExampleMatrix Multiplication

• A simple illustration of the basic features of memory and thread management in CUDA programs– Thread index usage– Memory layout– Register usage– Assume square matrix for simplicity– Leave shared memory usage until later

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Square Matrix-Matrix Multiplication• P = M * N of size WIDTH x WIDTH

– Each thread calculates one element of P

– Each row of M is loaded WIDTH times from global memory

– Each column of N is loaded WIDTH times from global memory

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

Page 8: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

8

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15

M

Row-Major Layout in C/C++

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Matrix MultiplicationA Simple Host Version in C

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width) {

for (int i = 0; i < Width; ++i)for (int j = 0; j < Width; ++j) {map

double sum = 0;for (int k = 0; k < Width; ++k)

double a = M[i * Width + k];double b = N[k * Width + j];sum += a * b;

}P[i * Width + j] = sum;

}}

i

k

k

j

Page 9: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

9

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Kernel Version: Functional Description

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *

d_N[k*Width+Col];

i

k

k

j

blockDim.y

blockDim.x

col

row

• Which thread is at coordinate (row,col)?

• Threads self allocate

Kernel Function - A Small Example

• Have each 2D thread block to compute a (TILE_WIDTH)2

sub-matrix (tile) of the result matrix– Each has (TILE_WIDTH)2 threads

• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P2,3 P3,3P3,1

Block(0,0) Block(0,1)

Block(1,1)Block(1,0)

WIDTH = 4; TILE_WIDTH = 2Each block has 2*2 = 4 threads

WIDTH/TILE_WIDTH = 2Use 2* 2 = 4 blocks

Page 10: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

10

A Slightly Bigger Example

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

P0,5P0,4

P1,4

P0,6 P0,7

P1,5

P2,4 P2,6 P2,7P2,5

P1,7P1,6

P3,4 P3,6 P3,7P3,5

P4,1P4,0

P5,0

P4,2 P4,3

P5,1

P6,0 P6,2 P6,3P6,1

P5,3P5,2

P7,0 P7,2 P7,3P7,1

P4,5P4,4

P5,4

P4,6 P4,7

P5,5

P6,4 P6,6 P6,7P6,5

P5,7P5,6

P7,4 P7,6 P7,7P7,5

WIDTH = 8; TILE_WIDTH = 2Each block has 2*2 = 4 threads

WIDTH/TILE_WIDTH = 4Use 4* 4 = 16 blocks

A Slightly Bigger Example (cont.)

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

WIDTH = 8; TILE_WIDTH = 4Each block has 4*4 =16 threads

WIDTH/TILE_WIDTH = 2Use 2* 2 = 4 blocks

P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

P0,5P0,4

P1,4

P0,6 P0,7

P1,5

P2,4 P2,6 P2,7P2,5

P1,7P1,6

P3,4 P3,6 P3,7P3,5

P4,1P4,0

P5,0

P4,2 P4,3

P5,1

P6,0 P6,2 P6,3P6,1

P5,3P5,2

P7,0 P7,2 P7,3P7,1

P4,5P4,4

P5,4

P4,6 P4,7

P5,5

P6,4 P6,6 P6,7P6,5

P5,7P5,6

P7,4 P7,6 P7,7P7,5

Page 11: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

11

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

// Setup the execution configuration// TILE_WIDTH is a #define constant

dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1);dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Kernel Invocation (Host-side Code)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) {

// Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0;

Page 12: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

12

Col = 0 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y

Col = 0

Col = 1

Work for Block (0,0)in a TILE_WIDTH = 2 Configuration

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13

P0,1P0,0

P1,0

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

M0,1M0,0

M1,0

M0,2 M0,3

M1,1

M2,0 M2,2 M2,3M2,1

M1,3M1,2

M3,0 M3,2 M3,3M3,1

N0,1N0,0

N1,0

N0,2 N0,3

N1,1

N2,0 N2,2 N2,3N2,1

N1,3N1,2

N3,0 N3,2 N3,3N3,1

Row = 0Row = 1

blockIdx.x blockIdx.y

blockDim.x blockDim.y

Work for Block (0,1)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

P0,1P0,0

P0,1

P0,2 P0,3

P1,1

P2,0 P2,2 P2,3P2,1

P1,3P1,2

P3,0 P3,2 P3,3P3,1

Row = 0Row = 1

Col = 2

Col = 3

Col = 1 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y

blockIdx.x blockIdx.y

M0,1M0,0

M1,0

M0,2 M0,3

M1,1

M2,0 M2,2 M2,3M2,1

M1,3M1,2

M3,0 M3,2 M3,3M3,1

N0,1N0,0

N1,0

N0,2 N0,3

N1,1

N2,0 N2,2 N2,3N2,1

N1,3N1,2

N3,0 N2,3 N3,3N3,1

blockDim.x blockDim.y

Page 13: Module 3: CUDA Execution Model -Iece8823-sy.ece.gatech.edu/.../550/2018/01/CUDA-ExecutionModel-I.pdf · ... “Programming Massively Parallel ... CUDA Programming Guide – ... (0,0,3)

13

A Simple Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the d_P element and d_M

int Row = blockIdx.y*blockDim.y+threadIdx.y;// Calculate the column idenx of d_P and d_N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *

d_N[k*Width+Col];d_P[Row*Width+Col] = Pvalue;}

}© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

QUESTIONS?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign26