Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Lecture22:DataLevelParallelism-- GraphicalProcessingUnit(GPU)andLoop-
LevelParallelism
CSCE513ComputerArchitecture
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE513
1
TopicsforDataLevelParallelism(DLP)
• Parallelism(centeredaround…)– InstructionLevelParallelism– DataLevelParallelism– ThreadLevelParallelism
• DLPIntroductionandVectorArchitecture– 4.1,4.2
• SIMDInstructionSetExtensionsforMultimedia– 4.3
• GraphicalProcessingUnits(GPU)– 4.4
• GPUandLoop-LevelParallelismandOthers– 4.4,4.5
ComputerGraphics
GPU:GraphicsProcessingUnit
3
GraphicsProcessingUnit(GPU)
4
Image:http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html
RecentGPUArchitecture
• UnifiedScalarShader Architecture
• HighlyDataParallelStreamProcessing
5AnIntroductiontoModernGPUArchitecture,Ashu Rege,NVIDIADirectorofDeveloperTechnologyftp://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
Image:http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html
UnifiedShader Architecture
6
FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
GPUToday
• Itisaprocessoroptimizedfor2D/3Dgraphics,video,visualcomputing,anddisplay.
• Itishighlyparallel,highlymultithreadedmultiprocessoroptimizedforvisualcomputing.
• Itprovidereal-timevisualinteractionwithcomputedobjectsviagraphicsimages,andvideo.
• Itservesasbothaprogrammablegraphicsprocessorandascalableparallelcomputingplatform.– Heterogeneoussystems:combineaGPUwithaCPU
• ItiscalledasMany-core
7
LatestNVIDIAVoltaGV100GPU
• ReleasedMay2017– Total84StreamMultiprocessors(SM)
• Cores– 5120FP32cores
• CandoFP16also– 2560FP64cores– 640Tensorcores
• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)
8https://devblogs.nvidia.com/parallelforall/inside-volta/
SMofVoltaGPU
9
• ReleasedMay2017– Total84SM
• Cores– 5120FP32cores
• CandoFP16also– 2560FP64cores– 640Tensorcores
• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)
SMofVoltaGPU
10
• ReleasedMay2017– Total84SM
• Cores– 5120FP32cores
• CandoFP16also– 2560FP64cores– 640Tensorcores
• Memory– 16GHBM2– L2:6144KB– Sharedmemory:96KB*80(SM)– RegisterFile:20,480KB(Huge)
GPUPerformanceGainsOverCPU
11
http://docs.nvidia.com/cuda/cuda-c-programming-guide
GPUPerformanceGainsOverCPU
12
ProgrammingforNVIDIAGPUs
13
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
CUDA(ComputeUnifiedDeviceArchitecture)
Bothanarchitecture andprogrammingmodel• Architectureandexecutionmodel
– IntroducedinNVIDIAin2007– Gethighestpossibleexecutionperformancerequires
understandingofhardwarearchitecture• Programmingmodel
– SmallsetofextensionstoC– EnablesGPUstoexecuteprogramswritteninC– WithinCprograms,callSIMT“kernel”routinesthatare
executedonGPU.
14
CUDAThread
• ParallelisminVector/SIMDisthecombinationoflanes(#PUs)andvectorlength
• CUDAthreadisaunifiedtermthatabstracttheparallelismforbothprogrammersandGPUexecutionmodel– Programmer:ACUDAthreadperformsoperationsforonedata
element(thinkofthiswayasofnow)• Therecouldbethousandsormillionsofthreads
– ACUDAthreadrepresentsahardwareFU• GPUcallsitacore(muchsimplerthanaconventionalCPUcore)
• Hardware-levelparallelismismoreexplicit
15
CUDAThreadHierarchy:
• Allowsflexibilityandefficiencyinprocessing1D,2-D,and3-DdataonGPU.
• Linkedtointernalorganization
• Threadsinoneblockexecutetogether.
16
Can be 1, 2 or 3 dimensions
DAXPY
// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke DAXPY with 256 threads per Thread Blockint nblocks = (n + 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
17
Creatinganumberofthreadswhichis(orslightlygreater)thenumberofelementstobeprocessed,andeachthreadlaunchthesamedaxpy function.
Eachthreadfindsitelementtocomputeanddothework.
DAXPYwithDeviceCode__global__ void daxpy( … )
• CUDAC/C++keyword__global__ indicatesafunctionthat:– Runsonthedevice– Iscalledfromhostcode
• nvcc compiler separatessourcecodeintohostanddevicecomponents– Devicefunctions(e.g.axpy())processedbyNVIDIAcompiler– Hostfunctions(e.g.main())processedbystandardhost
compiler• gcc,cl.exe
18
DAXPYwithDeviceCOde
axpy<<<num_blocks,num_threads>>>();
• Tripleanglebracketsmarkacallfromhost codetodevice code– Alsocalleda“kernellaunch”– <<<...>>>parametersareforthread
dimensionality• That’sallthatisrequiredtoexecuteafunctionontheGPU!
19
GPUComputing– OffloadingComputation
• TheGPUisconnectedtotheCPUbyareasonablefastbus(8GB/sistypicaltoday):PCIe
• Terminology– Host:TheCPUanditsmemory(hostmemory)– Device:TheGPUanditsmemory(devicememory)
20
SimpleProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
PCIBus
21
SimpleProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
PCIBus
22
SimpleProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
3. CopyresultsfromGPUmemorytoCPUmemory
PCIBus
23
OffloadingComputation// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];
}
int main(void) {int n = 1024;double a;double *x, *y; /* host copy of x and y */double *x_d; *y_d; /* device copy of x and y */int size = n * sizeof(double)// Alloc space for host copies and setup valuesx = (double *)malloc(size); fill_doubles(x, n);y = (double *)malloc(size); fill_doubles(y, n);
// Alloc space for device copiescudaMalloc((void **)&d_x, size);cudaMalloc((void **)&d_y, size);
// Copy to devicecudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);
// Invoke DAXPY with 256 threads per Blockint nblocks = (n+ 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x_d, y_d);
// Copy result back to hostcudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);
// Cleanupfree(x); free(y);cudaFree(d_x); cudaFree(d_y);return 0;
}
serialcode
parallelexeonGPU
serialcode
CUDAkernel
24
CUDAProgrammingModelforNVIDIAGPUs
• TheCUDAAPIissplitinto:– TheCUDAManagementAPI– TheCUDAKernelAPI
• TheCUDAManagementAPIisforavarietyofoperations– GPUmemoryallocation,datatransfer,execution,resource
creation– MostlyregularCfunctionandcalls
• TheCUDAKernelAPIisusedtodefinethecomputationtobeperformedbytheGPU– Cextensions
25
CUDAKernel,i.e.ThreadFunctions
• A CUDAkernel:– Definestheoperationstobeperformedbyasinglethreadon
theGPU– JustasaC/C++functiondefinesworktobedoneontheCPU– Syntactically,akernellookslikeC/C++withsomeextensions
__global__ void kernel(...) {...
}
– EveryCUDAthreadexecutesthesamekernellogic(SIMT)– Initially,theonlydifferencebetweenthreadsaretheirthread
coordinates
26
ProgrammingView:HowareCUDAthreadsorganized?
• CUDA threadhierarchy– ThreadBlock=SIMTGroupsthatrun
concurrentlyonanSM• CanbarriersyncandhavesharedaccesstotheSMsharedmemory
– Grid=AllThreadBlockscreatedbythesamekernellaunch• SharedaccesstoGPUglobalmemory
• Launchingakernelissimpleandsimilartoafunctioncall.– kernelnameandarguments– #ofthreadblocks/gridand#ofthreads/blocktocreate:kernel<<<nblocks,threads_per_block>>>(arg1, arg2, ...);
27
HowareCUDAthreadsorganized?
• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– One-dimensionalblocksandgrids:int nblocks = 4;int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
28
Block 0 Block 1 Block 2 Block 3
HowareCUDAthreadsorganized?
• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalblocksandgrids:dim3 nblocks(2,2)dim3 threads_per_block(4,2);kernel<<<nblocks, threads_per_block>>>(...);
29
HowareCUDAthreadsorganized?
• Threadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalgridandone-dimensionalblocks:dim3 nblocks(2,2);int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
30
HowareCUDAthreadsorganized?
• Thenumberofblocksandthreadsperblockisexposedthroughintrinsicthreadcoordinatevariables:– Dimensions– IDs
Variable MeaninggridDim.x, gridDim.y,
gridDim.zNumberofblocksinakernellaunch.
blockIdx.x, blockIdx.y, blockIdx.z
UniqueIDoftheblockthatcontainsthecurrentthread.
blockDim.x, blockDim.y, blockDim.z
Numberofthreadsineachblock.
threadIdx.x, threadIdx.y, threadIdx.z
UniqueIDofthecurrentthreadwithinitsblock.
31
HowareCUDAthreadsorganized?
tocalculateagloballyuniqueIDforathreadinsideaone-dimensionalgridandone-dimensionalblock:kernel<<<4, 8>>>(...);
__global__ void kernel(...) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
...
}
32
Block 0 Block 1 Block 2 Block 3
blockIdx.x = 2;blockDim.x = 8;threadIdx.x = 2;
01234567
8
HowareCUDAthreadsorganized?
• Threadcoordinatesofferawaytodifferentiatethreadsandidentifythread-specificinputdataorcodepaths.– Co-relatedataandcomputation,amapping
__global__ void kernel(int *arr) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < 32) {
arr[tid] = f(arr[tid]);
} else {
arr[tid] = g(arr[tid]);
}
33
codepathforthreadswithtid <32
codepathforthreadswithtid >=32
ThreadDivergence:uselesscodepathisexecuted,butthendisabledinSIMTexecutionmodel(EXE-commit,morelater
HowisGPUmemorymanaged?
• CUDAMemoryManagementAPI– AllocationofGPUmemory– TransferofdatafromthehosttoGPUmemory– Free-ing GPUmemory– Foo(int A[][N]){}
HostFunction CUDAAnalogue
malloc cudaMalloc
memcpy cudaMemcpy
free cudaFree
34
HowisGPUmemorymanaged?
cudaError_t cudaMalloc(void **devPtr, size_t size);
– Allocatesize bytesofGPUmemoryandstoretheiraddressat*devPtr
cudaError_t cudaFree(void *devPtr);– ReleasethedevicememoryallocationstoredatdevPtr– MustbeanallocationthatwascreatedusingcudaMalloc
35
HowisGPUmemorymanaged?
cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);– Transferscountbytesfromthememorypointedtobysrc to
dst– kind canbe:
• cudaMemcpyHostToHost,• cudaMemcpyHostToDevice,• cudaMemcpyDeviceToHost,• cudaMemcpyDeviceToDevice
– Thelocationsofdst andsrc mustmatchkind,e.g.ifkind iscudaMemcpyHostToDevice thensrc mustbeahostarrayanddst mustbeadevicearray
36
HowisGPUmemorymanaged?
void *d_arr, *h_arr;h_addr = … ; /* init host memory and data */// Allocate memory on GPU and its address is in d_arrcudaMalloc((void **)&d_arr, nbytes);
// Transfer data from host to devicecudaMemcpy(d_arr, h_arr, nbytes,
cudaMemcpyHostToDevice);
// Transfer data from a device to a hostcudaMemcpy(h_arr, d_arr, nbytes,
cudaMemcpyDeviceToHost);
// Free the allocated memorycudaFree(d_arr);
37
CUDAProgramFlow
• Atitsmostbasic,theflowofaCUDAprogramisasfollows:1. AllocateGPUmemory2. PopulateGPUmemorywithinputsfromthehost3. ExecuteaGPUkernelonthoseinputs4. TransferoutputsfromtheGPUbacktothehost5. FreeGPUmemory
38
OffloadingComputation// DAXPY in CUDA__global__void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];
}
int main(void) {int n = 1024;double a;double *x, *y; /* host copy of x and y */double *x_d; *y_d; /* device copy of x and y */int size = n * sizeof(double)// Alloc space for host copies and setup valuesx = (double *)malloc(size); fill_doubles(x, n);y = (double *)malloc(size); fill_doubles(y, n);
// Alloc space for device copiescudaMalloc((void **)&d_x, size);cudaMalloc((void **)&d_y, size);
// Copy to devicecudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);
// Invoke DAXPY with 256 threads per Blockint nblocks = (n+ 255) / 256;daxpy<<<nblocks, 256>>>(n, 2.0, x_d, y_d);
// Copy result back to hostcudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);
// Cleanupfree(x); free(y);cudaFree(d_x); cudaFree(d_y);return 0;
}
serialcode
parallelexeonGPU
serialcode
CUDAkernel
39
GPUMulti-Threading(SIMD)
• NVIDIAcallsitSingle-Instruction,Multiple-Thread(SIMT)– Manythreadsexecutethesameinstructionsinlock-step
• Awarp(32threads)• Eachthread≈vectorlane;32laneslockstep
– Implicitsynchronizationaftereveryinstruction(thinkvectorparallelism)
SIMT
40
GPUMulti-Threading
• InSIMT,allthreadsshareinstructionsbutoperateontheirownprivateregisters,allowingthreadstostorethread-localstate
SIMT
41
GPUMulti-Threading
• GPUsexecutemanygroupsofSIMTthreadsinparallel– Eachexecutesinstructionsindependentoftheothers
SIMT Group (Warp) 0
SIMT Group (Warp) 1
42
WarpSwitching
SMscansupportmoreconcurrentSIMTgroupsthancorecountwouldsuggestà Coarsegrainedmultiwarpping(thetermIcoined)
– Similartocoarse-grainedmulti-threading
• Eachthreadpersistentlystoresitsownstateinaprivateregisterset– Enableveryefficientcontextswitchingbetweenwarps
• SIMTwarpsblockifnotactivelycomputing– Swappedoutforother,noworryingaboutlosingstate
• KeepingblockedSIMTgroupsscheduledonanSMwouldwastecores
43
ExecutionModeltoHardware
• ThisleadstoanestedthreadhierarchyonGPUs
SIMT Group
SIMT Groups that concurrently run on the
same SM
SIMT Groups that execute together on the
same GPU
44
NVIDIAPTX(ParallelThreadExecution)ISA
• Compilertarget(NothardwareISA)– SimilartoX86ISA,andusevirtualregister– Bothtranslatetointernalform(micro-opsinx86)
• X86’stranslationhappensinhardwareatruntime• NVIDIAGPUPTXistranslatedbysoftwareatloadtime
• Basicformat(disdestination,a,bandcareoperands)opcode.type d, a, b, c;
45
BasicPTXOperations(ALU,MEM,andControl)
46
NVIDIAPTXGPUISAExample
DAXPY
shl.s32 R8,blockIdx,9 ;ThreadBlockID*Blocksize(512or29)
add.s32 R8,R8,threadIdx ;R8=i =myCUDAthreadIDld.global.f64 RD0,[X+R8] ;RD0=X[i]ld.global.f64 RD2,[Y+R8] ;RD2=Y[i]mul.f64R0D,RD0,RD4 ;ProductinRD0=RD0*RD4
(scalara)add.f64R0D,RD0,RD2 ;SuminRD0=RD0+RD2(Y[i])st.global.f64[Y+R8],RD0 ;Y[i]=sum(X[i]*a+Y[i])
47
__global__void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];
}
ConditionalBranchinginGPU
• Likevector,GPUbranchhardwareusesinternalmasks• Alsouses
– Branchsynchronizationstack• Entriesconsistofmasksforeachcore• I.e.whichthreadscommittheirresults(allthreadsexecute)
– Instructionmarkerstomanagewhenabranchdivergesintomultipleexecutionpaths• Pushondivergentbranch
– …andwhenpathsconverge• Actasbarriers• Popsstack
• Per-thread-lane1-bitpredicateregister,specifiedbyprogrammer
48
ConditionalBranchinginGPU
if (a > b) {
max = a;
} else {
max = b;
}
a = 4b = 3
a = 3b = 4
Disabled
Disab
led
• Instructionlock-stepexecutionbymulti-threads
• SIMTthreadscanbe“disabled”whentheyneedtoexecuteinstructionsdifferentfromothersintheirgroup– Maskandcommit
• Branchdivergence– Hurtperformanceand
efficiency
49
PTXExampleif(X[i]!=0)X[i]=X[i]– Y[i];
elseX[i]=Z[i];
ld.global.f64RD0,[X+R8] ;RD0=X[i]setp.neq.s32 P1,RD0,#0 ;P1ispredicateregister1@!P1,bra ELSE1,*Push ;Pusholdmask,setnewmaskbits
;ifP1false,gotoELSE1ld.global.f64RD2,[Y+R8] ;RD2=Y[i]sub.f64 RD0,RD0,RD2 ;DifferenceinRD0st.global.f64[X+R8],RD0 ;X[i]=RD0@P1,bra ENDIF1,*Comp ;complementmaskbits
;ifP1true,gotoENDIF1ELSE1: ld.global.f64RD0,[Z+R8] ;RD0=Z[i]
st.global.f64[X+R8],RD0 ;X[i]=RD0ENDIF1: <nextinstruction>,*Pop ;poptorestoreoldmask
50
NVIDIAGPUMemoryStructures
• Eachcorehasprivatesectionofoff-chipDRAM– “Privatememory”– Containsstackframe,spilling
registers,andprivatevariables• EachSMprocessoralsohaslocalmemory– Sharedbycores/threadswithina
SM/block• MemorysharedbySMprocessorsisGPUMemory– HostcanreadandwriteGPU
memory GLOBALMEMORY(ONDEVICE)
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHAREDMEMORY
51
GPUMemoryforCUDAProgramming
52
Localvariables,etc
Explicitlymanagedusingshared
cudaMalloc
SharedMemoryAllocation
• Sharedmemorycanbeallocatedstaticallyordynamically
• StaticallyAllocatedSharedMemory– Sizeisfixedatcompile-time– Candeclaremanystaticallyallocatedsharedmemory
variables– Canbedeclaredgloballyorinsideadevicefunction– Canbemulti-dimensionalarrays
__shared__ int s_arr[256][256];
53
SharedMemoryAllocation
• DynamicallyAllocatedSharedMemory– Sizeinbytesissetatkernellaunchwithathirdkernellaunch
configurable– Canonlyhaveonedynamicallyallocatedsharedmemory
arrayperkernel– Mustbeone-dimensionalarrays
__global__ void kernel(...) {extern __shared__ int s_arr[];...
}
kernel<<<nblocks, threads_per_block, shared_memory_bytes>>>(...);
54
GPUMemory
• Morecomplicated• Differentusagescope• Differentsize,andperformance
– Latencyandbandwidth– Read-onlyorR/Wcache
55
SIMT Thread Groups on a GPU
SIMT Thread Groups on an SM
SIMT Thread Group
Registers Local Memory
On-Chip Shared Memory
Global Memory
Constant Memory
Texture Memory
GPUandManycore Architecture
WeonlyINTRODUCEtheprogramminginterfaceandarchitecture
Formoreinfo:– http://docs.nvidia.com/cuda/– ProfessionalCUDACProgramming,JohnChengMax
GrossmanTyMcKercher September8,2014,JohnWiley&Sons
OtherRelatedinfo– AMDGPUandOpenCL– ProgrammingwithAcceleratorusingpragma
• OpenMPandOpenACC
56
Loop-LevelParallelism
• Focusesondeterminingwhetherdataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterations– Loop-carrieddependence
• Example1:for(i=999;i>=0;i=i-1)
x[i]=x[i]+s;
• Noloop-carrieddependence
57
Loop-LevelParallelism
• Example2:for(i=0;i<100;i=i+1){
S1:A[i+1]=A[i]+C[i]; /*S1*/S2:B[i+1]=B[i]+A[i+1]; /*S2*/}
• S1andS2usevaluescomputedbyS1andS2inpreviousiteration:loop-carried dependencyà serial execution– A[i]à A[i+1],B[i]à B[i+1]
• S2usesvaluecomputedbyS1insameiterationà notloopcarried– A[i+1]à A[i+1]
58
Loop-LevelParallelism
• Example3:for(i=0;i<100;i=i+1){A[i]=A[i]+B[i]; /*S1*/B[i+1]=C[i]+D[i]; /*S2*/
}S1usesvaluecomputedbyS2inpreviousiterationbutdependenceisnotcircularsoloopisparallel• Transformto:
A[0]=A[0]+B[0];for(i=0;i<99;i=i+1){B[i+1]=C[i]+D[i];A[i+1]=A[i+1]+B[i+1];
}B[100]=C[99]+D[99];
59
Loop-LevelParallelism
• Example4:for(i=0;i<100;i=i+1){A[i]=B[i]+C[i]; /*S1*/D[i]=A[i]*E[i]; /*S2*/}
• Example5:for(i=1;i<100;i=i+1){Y[i]=Y[i-1]+Y[i];}
NoneedtostoreA[i]inS1andthenloadA[i]inS2
Recurrence:forexploringpipelining
parallelismbetweeniterations
60
Findingdependencies
• Assumeindicesareaffine:– a xi +b(i isloopindexandaandbareconstants)
• Assume:– Storetoa xi +b,then– Loadfromc xi +d– i runsfromm ton– Dependenceexistsif:
• Givenj,k suchthatm ≤j ≤n,m ≤k ≤n• Storetoa xj +b,loadfroma xk +d,anda xj +b =c xk +d
61
Findingdependencies
• Generallycannotdetermineatcompiletime• Testforabsenceofadependence:
– GCDtest:• Ifadependencyexists,GCD(c,a)mustevenlydivide(d-b)
• Example:for(i=0;i<100;i=i+1){
X[2*i+3]=X[2*i]*5.0;}
a=2,b=3,c=2,andd=0,then GCD(a,c)=2,andd−b=−3.Since2does not divide −3,nodependence ispossible.
62
Findingdependencies
• Example2:for(i=0;i<100;i=i+1){
Y[i]=X[i]/c; /*S1*/X[i]=X[i]+c; /*S2*/Z[i]=Y[i]+c; /*S3*/Y[i]=c- Y[i]; /*S4*/
}• Truedependencies:
– S1toS3andS1toS4becauseofY[i],notloopcarried• Antidependence:
– S1toS2basedonX[i]andS3toS4forY[i]• Outputdependence:
– S1toS4basedonY[i]
63
Reductions
• ReductionOperation:for(i=9999;i>=0;i=i-1)sum=sum+x[i]*y[i];
• Transformto…for(i=9999;i>=0;i=i-1)sum[i]=x[i]*y[i];
for(i=9999;i>=0;i=i-1)finalsum =finalsum +sum[i];
• Doonpprocessors:for(i=999;i>=0;i=i-1)finalsum[p]=finalsum[p]+sum[i+1000*p];
• Note:assumesassociativity!
64
DependencyAnalysis
• Mostlydonebycompilerbeforevectorization– Canbeconservativeifcompilerisnot100%sure
• Forprogrammer:– Writecodethatcanbeeasilyanalyzedbycompilerfor
vectorization– UseexplicitparallelmodelsuchasOpenMPorCUDA
65
https://computing.llnl.gov/tutorials/openMP/
Wrap-Ups(Vector,SIMDandGPU)
• Data-levelparallelism
66