66
Lecture 22: Data Level Parallelism -- Graphical Processing Unit (GPU) and Loop- Level Parallelism CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513 1

Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Lecture22:DataLevelParallelism-- GraphicalProcessingUnit(GPU)andLoop-

LevelParallelism

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

1

Page 2: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

TopicsforDataLevelParallelism(DLP)

• Parallelism(centeredaround…)– InstructionLevelParallelism– DataLevelParallelism– ThreadLevelParallelism

• DLPIntroductionandVectorArchitecture– 4.1,4.2

• SIMDInstructionSetExtensionsforMultimedia– 4.3

• GraphicalProcessingUnits(GPU)– 4.4

• GPUandLoop-LevelParallelismandOthers– 4.4,4.5

Page 3: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ComputerGraphics

GPU:GraphicsProcessingUnit

3

Page 4: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GraphicsProcessingUnit(GPU)

4

Image:http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Page 5: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

RecentGPUArchitecture

• UnifiedScalarShader Architecture

• HighlyDataParallelStreamProcessing

5AnIntroductiontoModernGPUArchitecture,Ashu Rege,NVIDIADirectorofDeveloperTechnologyftp://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf

Image:http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Page 6: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

UnifiedShader Architecture

6

FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.

Page 7: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUToday

• Itisaprocessoroptimizedfor2D/3Dgraphics,video,visualcomputing,anddisplay.

• Itishighlyparallel,highlymultithreadedmultiprocessoroptimizedforvisualcomputing.

• Itprovidereal-timevisualinteractionwithcomputedobjectsviagraphicsimages,andvideo.

• Itservesasbothaprogrammablegraphicsprocessorandascalableparallelcomputingplatform.– Heterogeneoussystems:combineaGPUwithaCPU

• ItiscalledasMany-core

7

Page 8: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

LatestNVIDIAVoltaGV100GPU

• ReleasedMay2017– Total84StreamMultiprocessors(SM)

• Cores– 5120FP32cores

• CandoFP16also– 2560FP64cores– 640Tensorcores

• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)

8https://devblogs.nvidia.com/parallelforall/inside-volta/

Page 9: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SMofVoltaGPU

9

• ReleasedMay2017– Total84SM

• Cores– 5120FP32cores

• CandoFP16also– 2560FP64cores– 640Tensorcores

• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)

Page 10: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SMofVoltaGPU

10

• ReleasedMay2017– Total84SM

• Cores– 5120FP32cores

• CandoFP16also– 2560FP64cores– 640Tensorcores

• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)

Page 11: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUPerformanceGainsOverCPU

11

http://docs.nvidia.com/cuda/cuda-c-programming-guide

Page 12: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUPerformanceGainsOverCPU

12

Page 13: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ProgrammingforNVIDIAGPUs

13

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

Page 14: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDA(ComputeUnifiedDeviceArchitecture)

Bothanarchitecture andprogrammingmodel• Architectureandexecutionmodel

– IntroducedinNVIDIAin2007– Gethighestpossibleexecutionperformancerequires

understandingofhardwarearchitecture• Programmingmodel

– SmallsetofextensionstoC– EnablesGPUstoexecuteprogramswritteninC– WithinCprograms,callSIMT“kernel”routinesthatare

executedonGPU.

14

Page 15: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDAThread

• ParallelisminVector/SIMDisthecombinationoflanes(#PUs)andvectorlength

• CUDAthreadisaunifiedtermthatabstracttheparallelismforbothprogrammersandGPUexecutionmodel– Programmer:ACUDAthreadperformsoperationsforonedata

element(thinkofthiswayasofnow)• Therecouldbethousandsormillionsofthreads

– ACUDAthreadrepresentsahardwareFU• GPUcallsitacore(muchsimplerthanaconventionalCPUcore)

• Hardware-levelparallelismismoreexplicit

15

Page 16: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDAThreadHierarchy:

• Allowsflexibilityandefficiencyinprocessing1D,2-D,and3-DdataonGPU.

• Linkedtointernalorganization

• Threadsinoneblockexecutetogether.

16

Can be 1, 2 or 3 dimensions

Page 17: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

DAXPY

// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {

int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke DAXPY with 256 threads per Thread Blockint nblocks = (n + 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x, y);

17

Creatinganumberofthreadswhichis(orslightlygreater)thenumberofelementstobeprocessed,andeachthreadlaunchthesamedaxpy function.

Eachthreadfindsitelementtocomputeanddothework.

Page 18: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

DAXPYwithDeviceCode__global__ void daxpy( … )

• CUDAC/C++keyword__global__ indicatesafunctionthat:– Runsonthedevice– Iscalledfromhostcode

• nvcc compiler separatessourcecodeintohostanddevicecomponents– Devicefunctions(e.g.axpy())processedbyNVIDIAcompiler– Hostfunctions(e.g.main())processedbystandardhost

compiler• gcc,cl.exe

18

Page 19: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

DAXPYwithDeviceCOde

axpy<<<num_blocks,num_threads>>>();

• Tripleanglebracketsmarkacallfromhost codetodevice code– Alsocalleda“kernellaunch”– <<<...>>>parametersareforthread

dimensionality• That’sallthatisrequiredtoexecuteafunctionontheGPU!

19

Page 20: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUComputing– OffloadingComputation

• TheGPUisconnectedtotheCPUbyareasonablefastbus(8GB/sistypicaltoday):PCIe

• Terminology– Host:TheCPUanditsmemory(hostmemory)– Device:TheGPUanditsmemory(devicememory)

20

Page 21: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SimpleProcessingFlow

1. CopyinputdatafromCPUmemorytoGPUmemory

PCIBus

21

Page 22: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SimpleProcessingFlow

1. CopyinputdatafromCPUmemorytoGPUmemory

2. LoadGPUprogramandexecute,cachingdataonchipforperformance

PCIBus

22

Page 23: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SimpleProcessingFlow

1. CopyinputdatafromCPUmemorytoGPUmemory

2. LoadGPUprogramandexecute,cachingdataonchipforperformance

3. CopyresultsfromGPUmemorytoCPUmemory

PCIBus

23

Page 24: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

OffloadingComputation// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {

int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];

}

int main(void) {int n = 1024;double a;double *x, *y; /* host copy of x and y */double *x_d; *y_d; /* device copy of x and y */int size = n * sizeof(double)// Alloc space for host copies and setup valuesx = (double *)malloc(size); fill_doubles(x, n);y = (double *)malloc(size); fill_doubles(y, n);

// Alloc space for device copiescudaMalloc((void **)&d_x, size);cudaMalloc((void **)&d_y, size);

// Copy to devicecudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);

// Invoke DAXPY with 256 threads per Blockint nblocks = (n+ 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x_d, y_d);

// Copy result back to hostcudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);

// Cleanupfree(x); free(y);cudaFree(d_x); cudaFree(d_y);return 0;

}

serialcode

parallelexeonGPU

serialcode

CUDAkernel

24

Page 25: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDAProgrammingModelforNVIDIAGPUs

• TheCUDAAPIissplitinto:– TheCUDAManagementAPI– TheCUDAKernelAPI

• TheCUDAManagementAPIisforavarietyofoperations– GPUmemoryallocation,datatransfer,execution,resource

creation– MostlyregularCfunctionandcalls

• TheCUDAKernelAPIisusedtodefinethecomputationtobeperformedbytheGPU– Cextensions

25

Page 26: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDAKernel,i.e.ThreadFunctions

• A CUDAkernel:– Definestheoperationstobeperformedbyasinglethreadon

theGPU– JustasaC/C++functiondefinesworktobedoneontheCPU– Syntactically,akernellookslikeC/C++withsomeextensions

__global__ void kernel(...) {...

}

– EveryCUDAthreadexecutesthesamekernellogic(SIMT)– Initially,theonlydifferencebetweenthreadsaretheirthread

coordinates

26

Page 27: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ProgrammingView:HowareCUDAthreadsorganized?

• CUDA threadhierarchy– ThreadBlock=SIMTGroupsthatrun

concurrentlyonanSM• CanbarriersyncandhavesharedaccesstotheSMsharedmemory

– Grid=AllThreadBlockscreatedbythesamekernellaunch• SharedaccesstoGPUglobalmemory

• Launchingakernelissimpleandsimilartoafunctioncall.– kernelnameandarguments– #ofthreadblocks/gridand#ofthreads/blocktocreate:kernel<<<nblocks,threads_per_block>>>(arg1, arg2, ...);

27

Page 28: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts

– One-dimensionalblocksandgrids:int nblocks = 4;int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);

28

Block 0 Block 1 Block 2 Block 3

Page 29: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts

– Two-dimensionalblocksandgrids:dim3 nblocks(2,2)dim3 threads_per_block(4,2);kernel<<<nblocks, threads_per_block>>>(...);

29

Page 30: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts

– Two-dimensionalgridandone-dimensionalblocks:dim3 nblocks(2,2);int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);

30

Page 31: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

• Thenumberofblocksandthreadsperblockisexposedthroughintrinsicthreadcoordinatevariables:– Dimensions– IDs

Variable MeaninggridDim.x, gridDim.y,

gridDim.zNumberofblocksinakernellaunch.

blockIdx.x, blockIdx.y, blockIdx.z

UniqueIDoftheblockthatcontainsthecurrentthread.

blockDim.x, blockDim.y, blockDim.z

Numberofthreadsineachblock.

threadIdx.x, threadIdx.y, threadIdx.z

UniqueIDofthecurrentthreadwithinitsblock.

31

Page 32: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

tocalculateagloballyuniqueIDforathreadinsideaone-dimensionalgridandone-dimensionalblock:kernel<<<4, 8>>>(...);

__global__ void kernel(...) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;

...

}

32

Block 0 Block 1 Block 2 Block 3

blockIdx.x = 2;blockDim.x = 8;threadIdx.x = 2;

01234567

8

Page 33: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowareCUDAthreadsorganized?

• Threadcoordinatesofferawaytodifferentiatethreadsandidentifythread-specificinputdataorcodepaths.– Co-relatedataandcomputation,amapping

__global__ void kernel(int *arr) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < 32) {

arr[tid] = f(arr[tid]);

} else {

arr[tid] = g(arr[tid]);

}

33

codepathforthreadswithtid <32

codepathforthreadswithtid >=32

ThreadDivergence:uselesscodepathisexecuted,butthendisabledinSIMTexecutionmodel(EXE-commit,morelater

Page 34: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowisGPUmemorymanaged?

• CUDAMemoryManagementAPI– AllocationofGPUmemory– TransferofdatafromthehosttoGPUmemory– Free-ing GPUmemory– Foo(int A[][N]){}

HostFunction CUDAAnalogue

malloc cudaMalloc

memcpy cudaMemcpy

free cudaFree

34

Page 35: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowisGPUmemorymanaged?

cudaError_t cudaMalloc(void **devPtr, size_t size);

– Allocatesize bytesofGPUmemoryandstoretheiraddressat*devPtr

cudaError_t cudaFree(void *devPtr);– ReleasethedevicememoryallocationstoredatdevPtr– MustbeanallocationthatwascreatedusingcudaMalloc

35

Page 36: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowisGPUmemorymanaged?

cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);– Transferscountbytesfromthememorypointedtobysrc to

dst– kind canbe:

• cudaMemcpyHostToHost,• cudaMemcpyHostToDevice,• cudaMemcpyDeviceToHost,• cudaMemcpyDeviceToDevice

– Thelocationsofdst andsrc mustmatchkind,e.g.ifkind iscudaMemcpyHostToDevice thensrc mustbeahostarrayanddst mustbeadevicearray

36

Page 37: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

HowisGPUmemorymanaged?

void *d_arr, *h_arr;h_addr = … ; /* init host memory and data */// Allocate memory on GPU and its address is in d_arrcudaMalloc((void **)&d_arr, nbytes);

// Transfer data from host to devicecudaMemcpy(d_arr, h_arr, nbytes,

cudaMemcpyHostToDevice);

// Transfer data from a device to a hostcudaMemcpy(h_arr, d_arr, nbytes,

cudaMemcpyDeviceToHost);

// Free the allocated memorycudaFree(d_arr);

37

Page 38: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

CUDAProgramFlow

• Atitsmostbasic,theflowofaCUDAprogramisasfollows:1. AllocateGPUmemory2. PopulateGPUmemorywithinputsfromthehost3. ExecuteaGPUkernelonthoseinputs4. TransferoutputsfromtheGPUbacktothehost5. FreeGPUmemory

38

Page 39: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

OffloadingComputation// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {

int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];

}

int main(void) {int n = 1024;double a;double *x, *y; /* host copy of x and y */double *x_d; *y_d; /* device copy of x and y */int size = n * sizeof(double)// Alloc space for host copies and setup valuesx = (double *)malloc(size); fill_doubles(x, n);y = (double *)malloc(size); fill_doubles(y, n);

// Alloc space for device copiescudaMalloc((void **)&d_x, size);cudaMalloc((void **)&d_y, size);

// Copy to devicecudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);

// Invoke DAXPY with 256 threads per Blockint nblocks = (n+ 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x_d, y_d);

// Copy result back to hostcudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);

// Cleanupfree(x); free(y);cudaFree(d_x); cudaFree(d_y);return 0;

}

serialcode

parallelexeonGPU

serialcode

CUDAkernel

39

Page 40: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUMulti-Threading(SIMD)

• NVIDIAcallsitSingle-Instruction,Multiple-Thread(SIMT)– Manythreadsexecutethesameinstructionsinlock-step

• Awarp(32threads)• Eachthread≈vectorlane;32laneslockstep

– Implicitsynchronizationaftereveryinstruction(thinkvectorparallelism)

SIMT

40

Page 41: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUMulti-Threading

• InSIMT,allthreadsshareinstructionsbutoperateontheirownprivateregisters,allowingthreadstostorethread-localstate

SIMT

41

Page 42: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUMulti-Threading

• GPUsexecutemanygroupsofSIMTthreadsinparallel– Eachexecutesinstructionsindependentoftheothers

SIMT Group (Warp) 0

SIMT Group (Warp) 1

42

Page 43: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

WarpSwitching

SMscansupportmoreconcurrentSIMTgroupsthancorecountwouldsuggestà Coarsegrainedmultiwarpping(thetermIcoined)

– Similartocoarse-grainedmulti-threading

• Eachthreadpersistentlystoresitsownstateinaprivateregisterset– Enableveryefficientcontextswitchingbetweenwarps

• SIMTwarpsblockifnotactivelycomputing– Swappedoutforother,noworryingaboutlosingstate

• KeepingblockedSIMTgroupsscheduledonanSMwouldwastecores

43

Page 44: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ExecutionModeltoHardware

• ThisleadstoanestedthreadhierarchyonGPUs

SIMT Group

SIMT Groups that concurrently run on the

same SM

SIMT Groups that execute together on the

same GPU

44

Page 45: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

NVIDIAPTX(ParallelThreadExecution)ISA

• Compilertarget(NothardwareISA)– SimilartoX86ISA,andusevirtualregister– Bothtranslatetointernalform(micro-opsinx86)

• X86’stranslationhappensinhardwareatruntime• NVIDIAGPUPTXistranslatedbysoftwareatloadtime

• Basicformat(disdestination,a,bandcareoperands)opcode.type d, a, b, c;

45

Page 46: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

BasicPTXOperations(ALU,MEM,andControl)

46

Page 47: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

NVIDIAPTXGPUISAExample

DAXPY

shl.s32 R8,blockIdx,9 ;ThreadBlockID*Blocksize(512or29)

add.s32 R8,R8,threadIdx ;R8=i =myCUDAthreadIDld.global.f64 RD0,[X+R8] ;RD0=X[i]ld.global.f64 RD2,[Y+R8] ;RD2=Y[i]mul.f64R0D,RD0,RD4 ;ProductinRD0=RD0*RD4

(scalara)add.f64R0D,RD0,RD2 ;SuminRD0=RD0+RD2(Y[i])st.global.f64[Y+R8],RD0 ;Y[i]=sum(X[i]*a+Y[i])

47

__global__void daxpy(int n, double a, double *x, double *y) {

int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];

}

Page 48: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ConditionalBranchinginGPU

• Likevector,GPUbranchhardwareusesinternalmasks• Alsouses

– Branchsynchronizationstack• Entriesconsistofmasksforeachcore• I.e.whichthreadscommittheirresults(allthreadsexecute)

– Instructionmarkerstomanagewhenabranchdivergesintomultipleexecutionpaths• Pushondivergentbranch

– …andwhenpathsconverge• Actasbarriers• Popsstack

• Per-thread-lane1-bitpredicateregister,specifiedbyprogrammer

48

Page 49: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

ConditionalBranchinginGPU

if (a > b) {

max = a;

} else {

max = b;

}

a = 4b = 3

a = 3b = 4

Disabled

Disab

led

• Instructionlock-stepexecutionbymulti-threads

• SIMTthreadscanbe“disabled”whentheyneedtoexecuteinstructionsdifferentfromothersintheirgroup– Maskandcommit

• Branchdivergence– Hurtperformanceand

efficiency

49

Page 50: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

PTXExampleif(X[i]!=0)X[i]=X[i]– Y[i];

elseX[i]=Z[i];

ld.global.f64RD0,[X+R8] ;RD0=X[i]setp.neq.s32 P1,RD0,#0 ;P1ispredicateregister1@!P1,bra ELSE1,*Push ;Pusholdmask,setnewmaskbits

;ifP1false,gotoELSE1ld.global.f64RD2,[Y+R8] ;RD2=Y[i]sub.f64 RD0,RD0,RD2 ;DifferenceinRD0st.global.f64[X+R8],RD0 ;X[i]=RD0@P1,bra ENDIF1,*Comp ;complementmaskbits

;ifP1true,gotoENDIF1ELSE1: ld.global.f64RD0,[Z+R8] ;RD0=Z[i]

st.global.f64[X+R8],RD0 ;X[i]=RD0ENDIF1: <nextinstruction>,*Pop ;poptorestoreoldmask

50

Page 51: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

NVIDIAGPUMemoryStructures

• Eachcorehasprivatesectionofoff-chipDRAM– “Privatememory”– Containsstackframe,spilling

registers,andprivatevariables• EachSMprocessoralsohaslocalmemory– Sharedbycores/threadswithina

SM/block• MemorysharedbySMprocessorsisGPUMemory– HostcanreadandwriteGPU

memory GLOBALMEMORY(ONDEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHAREDMEMORY

51

Page 52: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUMemoryforCUDAProgramming

52

Localvariables,etc

Explicitlymanagedusingshared

cudaMalloc

Page 53: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SharedMemoryAllocation

• Sharedmemorycanbeallocatedstaticallyordynamically

• StaticallyAllocatedSharedMemory– Sizeisfixedatcompile-time– Candeclaremanystaticallyallocatedsharedmemory

variables– Canbedeclaredgloballyorinsideadevicefunction– Canbemulti-dimensionalarrays

__shared__ int s_arr[256][256];

53

Page 54: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

SharedMemoryAllocation

• DynamicallyAllocatedSharedMemory– Sizeinbytesissetatkernellaunchwithathirdkernellaunch

configurable– Canonlyhaveonedynamicallyallocatedsharedmemory

arrayperkernel– Mustbeone-dimensionalarrays

__global__ void kernel(...) {extern __shared__ int s_arr[];...

}

kernel<<<nblocks, threads_per_block, shared_memory_bytes>>>(...);

54

Page 55: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUMemory

• Morecomplicated• Differentusagescope• Differentsize,andperformance

– Latencyandbandwidth– Read-onlyorR/Wcache

55

SIMT Thread Groups on a GPU

SIMT Thread Groups on an SM

SIMT Thread Group

Registers Local Memory

On-Chip Shared Memory

Global Memory

Constant Memory

Texture Memory

Page 56: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

GPUandManycore Architecture

WeonlyINTRODUCEtheprogramminginterfaceandarchitecture

Formoreinfo:– http://docs.nvidia.com/cuda/– ProfessionalCUDACProgramming,JohnChengMax

GrossmanTyMcKercher September8,2014,JohnWiley&Sons

OtherRelatedinfo– AMDGPUandOpenCL– ProgrammingwithAcceleratorusingpragma

• OpenMPandOpenACC

56

Page 57: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Loop-LevelParallelism

• Focusesondeterminingwhetherdataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterations– Loop-carrieddependence

• Example1:for(i=999;i>=0;i=i-1)

x[i]=x[i]+s;

• Noloop-carrieddependence

57

Page 58: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Loop-LevelParallelism

• Example2:for(i=0;i<100;i=i+1){

S1:A[i+1]=A[i]+C[i]; /*S1*/S2:B[i+1]=B[i]+A[i+1]; /*S2*/}

• S1andS2usevaluescomputedbyS1andS2inpreviousiteration:loop-carried dependencyà serial execution– A[i]à A[i+1],B[i]à B[i+1]

• S2usesvaluecomputedbyS1insameiterationà notloopcarried– A[i+1]à A[i+1]

58

Page 59: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Loop-LevelParallelism

• Example3:for(i=0;i<100;i=i+1){A[i]=A[i]+B[i]; /*S1*/B[i+1]=C[i]+D[i]; /*S2*/

}S1usesvaluecomputedbyS2inpreviousiterationbutdependenceisnotcircularsoloopisparallel• Transformto:

A[0]=A[0]+B[0];for(i=0;i<99;i=i+1){B[i+1]=C[i]+D[i];A[i+1]=A[i+1]+B[i+1];

}B[100]=C[99]+D[99];

59

Page 60: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Loop-LevelParallelism

• Example4:for(i=0;i<100;i=i+1){A[i]=B[i]+C[i]; /*S1*/D[i]=A[i]*E[i]; /*S2*/}

• Example5:for(i=1;i<100;i=i+1){Y[i]=Y[i-1]+Y[i];}

NoneedtostoreA[i]inS1andthenloadA[i]inS2

Recurrence:forexploringpipelining

parallelismbetweeniterations

60

Page 61: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Findingdependencies

• Assumeindicesareaffine:– a xi +b(i isloopindexandaandbareconstants)

• Assume:– Storetoa xi +b,then– Loadfromc xi +d– i runsfromm ton– Dependenceexistsif:

• Givenj,k suchthatm ≤j ≤n,m ≤k ≤n• Storetoa xj +b,loadfroma xk +d,anda xj +b =c xk +d

61

Page 62: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Findingdependencies

• Generallycannotdetermineatcompiletime• Testforabsenceofadependence:

– GCDtest:• Ifadependencyexists,GCD(c,a)mustevenlydivide(d-b)

• Example:for(i=0;i<100;i=i+1){

X[2*i+3]=X[2*i]*5.0;}

a=2,b=3,c=2,andd=0,then GCD(a,c)=2,andd−b=−3.Since2does not divide −3,nodependence ispossible.

62

Page 63: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Findingdependencies

• Example2:for(i=0;i<100;i=i+1){

Y[i]=X[i]/c; /*S1*/X[i]=X[i]+c; /*S2*/Z[i]=Y[i]+c; /*S3*/Y[i]=c- Y[i]; /*S4*/

}• Truedependencies:

– S1toS3andS1toS4becauseofY[i],notloopcarried• Antidependence:

– S1toS2basedonX[i]andS3toS4forY[i]• Outputdependence:

– S1toS4basedonY[i]

63

Page 64: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Reductions

• ReductionOperation:for(i=9999;i>=0;i=i-1)sum=sum+x[i]*y[i];

• Transformto…for(i=9999;i>=0;i=i-1)sum[i]=x[i]*y[i];

for(i=9999;i>=0;i=i-1)finalsum =finalsum +sum[i];

• Doonpprocessors:for(i=999;i>=0;i=i-1)finalsum[p]=finalsum[p]+sum[i+1000*p];

• Note:assumesassociativity!

64

Page 65: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

DependencyAnalysis

• Mostlydonebycompilerbeforevectorization– Canbeconservativeifcompilerisnot100%sure

• Forprogrammer:– Writecodethatcanbeeasilyanalyzedbycompilerfor

vectorization– UseexplicitparallelmodelsuchasOpenMPorCUDA

65

https://computing.llnl.gov/tutorials/openMP/

Page 66: Lecture 22: Data Level Parallelism --Graphical Processing Unit … · 2019-01-10 · GPU Today •It is a processor optimized for 2D/3D graphics, video, visual computing, and display

Wrap-Ups(Vector,SIMDandGPU)

• Data-levelparallelism

66