26
GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 1/1

Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

GPU Programming

Lecture 1: Introduction

Miaoqing HuangUniversity of Arkansas

1 / 1

Page 2: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Outline

Course Introduction

GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model

Programming GPUs using Nvidia CUDA

2 / 1

Page 3: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Outline

Course Introduction

GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model

Programming GPUs using Nvidia CUDA

3 / 1

Page 4: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Course ObjectiveI GPU programming

I Learn how to program massively parallel processors and achieveI High performanceI Scalability across future generations

I Acquire technical knowledge required to achieve the above goalsI Principles and patterns of parallel programmingI Processor architecture features and constraintsI Programming APIs, tools and techniques

4 / 1

Page 5: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Programming assignments, Projects, and Course gradingI Constituent components of course grading

I ∼10 programming assignments: 60%I Quizzes: 10%

I Frequently given, approximately once a weekI Final exam: 30%

I All programming assignments are supposed to be carried outindividually

5 / 1

Page 6: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Outline

Course Introduction

GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model

Programming GPUs using Nvidia CUDA

6 / 1

Page 7: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Why Massively Parallel Processor

I A quiet revolution and potential build-upI Performance advantages again multicore CPU

I GFLOPS: 1,000 vs. 100 (in year 2009)I Memory bandwidth (GB/s): 200 vs. 20

I GPU in every PC and workstation - massive volume and potentialimpact

7 / 1

Page 8: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Different Design Philosophies

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

I CPU: sequential executionI Multiple complicated ALU designI Complicated control logic, e.g., branch predictionI Big cache

I GPU: parallel computingI Many simple processing coresI Simple control and scheduling logicI No or small cache

8 / 1

Page 9: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Architecture of Fermi GPU

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Register File (32,768 × 32-bit)

Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler

Instruction Cache

Interconnect Network64 KB Shared Memory / L1 Cache

Fermi Streaming Multiprocessor (SM)

FermiSM

L2 Cache

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

Gig

athr

ead

Hos

t In

terfa

ceD

RA

MD

RA

M

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

16-SM Fermi GPU

Dispatch PortOperant Collector

Result Queue

FP Unit INT Unit

CUDA Core

I 512 Fermi streaming processors in 16 streaming multiprocessors

9 / 1

Page 10: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Architecture of Kepler GPU

I 2,880 streaming processors in 15 streaming multiprocessors

10 / 1

Page 11: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Architecture of Kepler Streaming Multiprocessor

I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores

11 / 1

Page 12: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Basic Programming Model on GPU Chapter 2: Programming Model

CUDA C Programming Guide Version 3.1 9

Figure 2-1. Grid of Thread Blocks

The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.

Each block within the grid can be identified by a one-dimensional or two-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.

Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

Grid

Block (1, 1)

Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Block (2, 1) Block (1, 1) Block (0, 1)

Block (2, 0) Block (1, 0) Block (0, 0)

I Issue hundreds of thousandsof threads targetingthousands of processors

12 / 1

Page 13: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals

32

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

13 / 1

Page 14: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals 33

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

14 / 1

Page 15: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals 34

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

15 / 1

Page 16: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals 35

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

16 / 1

Page 17: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals 36

Hiding shader stalls Time

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

17 / 1

Page 18: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Execution Model on GPU

Beyond Programmable Shading: Fundamentals 37

Throughput! Time

(clocks)

Stall

Runnable

2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one group To maximum throughput of many groups

Start

Start

Start

18 / 1

Page 19: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

How to deal with branches?

Beyond Programmable Shading: Fundamentals 25

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

19 / 1

Page 20: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

How to deal with branches?

Beyond Programmable Shading: Fundamentals 26

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

20 / 1

Page 21: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

How to deal with branches?

Beyond Programmable Shading: Fundamentals 27

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

21 / 1

Page 22: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

How to deal with branches?

Beyond Programmable Shading: Fundamentals 28

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

22 / 1

Page 23: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Partition of an applicationSequential portions

Traditional CPU coverage

Parallel portions

GPU coverage

Obstacles

I Increase the data parallel portion of an applicationI Analyze an existing applicationI Expand the data volume of the parallel part

23 / 1

Page 24: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Outline

Course Introduction

GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model

Programming GPUs using Nvidia CUDA

24 / 1

Page 25: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Nvidia CUDAI CUDA driver

I Handle the communication with Nvidia GPUsI CUDA toolkit

I Contain the tools needed to compile and build a CUDA applicationI CUDA SDK

I Include sample projects that provide source code and otherresources for constructing CUDA programs

25 / 1

Page 26: Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

Access GPU ClusterI Open a web browser, on the address bar type in

“pinnacle-portal.uark.edu”I On the popup window, enter your uark ID and password

I When you are on the dashboard page of“pinnacle-portal.uark.edu”I Click on Clusters→Karpinski Shell Access to open a

shellI In the shell, type in gpu-node4me.sh to allocate a computer node

for 2 hours to programI Once you finish using the computer node, you can type in exit to

release the nodeI The shell supports “Ctrl-V” for pasting. In the shell, when you

select some texts, they are automatically copied.I Click on Files→/karpinski/home-folder/ to open a new

webpage for downloading/uploading filesI Off-campus access to pinnacle-portal

I Install and run GlobalProtect vpn firstI https://its.uark.edu/network-access/vpn/index.php

26 / 1