79
Programming Heterogeneous Parallel Architectures Directive Programming with OpenACC Michael Wolfe [email protected] http://www.pgroup.com 1 CEA-INRIA-EDF Ecoles d’Eté 2013

Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Programming Heterogeneous Parallel Architectures

Directive Programming

with OpenACC

Michael Wolfe [email protected]

http://www.pgroup.com

1

CEA-INRIA-EDF Ecoles d’Eté

2013

Page 2: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Outline

!! PGI

!! Introduction !! CPU Architecture !! Parallel Programming !! Accelerator Architectures !! Accelerator Programming

!! Beginning OpenACC

!! Performance Tuning

!! Advanced Topics

2

Page 3: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

!! C, C++, Fortran compilers !! Optimizing !! Vectorizing !! Parallelizing

!! Graphical parallel tools !! PGDBG debugger !! PGPROF performance profiler

!! AMD, Intel, NVIDIA processors !! 64-bit / 32-bit !! PGI Unified Binary™ technology !! Linux, MacOS, Windows !! Visual Studio integration !! NVIDIA GPU support

!! CUDA Fortran !! OpenACC™

www.pgroup.com

Page 4: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Outline

!! PGI

!! Introduction

!!CPU Architecture !! Parallel Programming !! Accelerator Architectures !! Accelerator Programming

!! Beginning OpenACC

!! Intermediate OpenACC

!! Advanced Topics 4

Page 5: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

FP register file integer register file

+ x &

integer register file

FP+ FPx

div/sqrt

FP register file

bran

ch u

nit

control unit

load

/sto

re u

nit

Level 1 cache

Level 2 cache

x

x

FP+

FP+

FPx

FPx

FPx

101 102

pic11

209 208 210

Page 6: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

AMD “Magny-Cours”

Page 7: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host Architecture Features

•! Register files (integer, float)

•! Functional units (integer, float, address), branch, load/store

•! Cache: data, instruction

•! Execution pipeline (fetch,decode,issue,execute,cache,commit)

1-7

Page 8: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Making It Faster

•! Faster clocks •! silicon technology •! pipelining

•! More work per clock: •! pipelining •! multiscalar execution: superscalar control unit, VLIW •! vector / SIMD instructions •! multiple cores (MESI cache protocols)

•! Fewer Stalls •! reservation stations, out-of-order execution •! branch prediction, speculative execution •! multithreading •! cache memory, cache coherence 1-8

Page 9: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

AMD “Magny-Cours”

9

Page 10: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Parallel Programming on CPUs

!! Hide / Expose and Virtualize / Explicit

!! Instruction level parallelism (ILP) !! Loop unrolling, instruction scheduling

!! Vector parallelism !! Vectorized loops (or vector intrinsics)

!! Thread level / Multiprocessor / multicore parallelism !! Parallel loops, parallel tasks !! Posix threads, OpenMP, Cilk, TBB, .....

!! Large scale cluster / multicomputer parallelism !! MPI (& HPF, co-array Fortran, UPC, Titanium, X10, Fortress,

Chapel) 1-10

Page 11: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

pthreads parallel vector add

for( i = 1; i < ncores; ++i ) pthread_create( &th[i], NULL, vadd, i ); vadd( 0 ); for( i = 1; i < ncores; ++i ) pthread_join( th[i], NULL ); ... void vadd( int threadnum ){ int work = (n+numthreads-1)/numthreads; int ifirst = threadnum*work; int ilast = min(n,(threadnum+1)*work); for( int i = ifirst; i < ilast; ++i ) a[i]= b[i] + c[i]; }

11

Page 12: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenMP parallel vector add

#pragma omp parallel do private(i) for( i = 0; i < n; ++i ) a[i] = b[i] + c[i];

12

Page 13: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenMP Behind the Scenes

!! Compiler generates code for N threads: !! split up the iterations across N threads

!! Assumptions !! uniform memory access costs !! coherent cache mechanism

!! Virtualization penalties !! load balancing !! cache locality !! vectorization within the threads !! thread management !! loop scheduling (which thread does what iteration) !! NUMA memory access penalty

!! Productivity, Portability, Performance 13

Page 14: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Outline

!! PGI

!! Introduction !! CPU Architecture !! Parallel Programming

!!Accelerator Architectures !! Accelerator Programming

!! Beginning OpenACC

!! Performance Tuning

!! Advanced Topics 14

Page 15: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

NVIDIA Kepler Overall Block Diagram*

* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation.

Page 16: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

NVIDIA Kepler SMX Block Diagram*

* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation.

!! 192 SP CUDA cores

!! 64 DP units

!! 32 SFUs

!! 32 ld/st units

16

Page 17: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Intel Xeon Phi Coprocessor Block Diagram*

* From: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-block-diagram.htm © 2012 Intel Corporation l

Page 18: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Intel Xeon Phi Cores Block Diagram*

* From: “Intel Xeon processors and Intel Xeon Phi coprocessors”, Copyright © 2012, Intel Corporation.

Page 19: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

AMD Radeon 7970 Block Diagram*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.

Page 20: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

AMD Radeon 7970 Compute Unit*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.

Page 21: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Abstract Accelerator Target

Page 22: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Architecture Features

!! Separate memory (relatively small)

!! High bandwidth memory interface

!! Many degrees of parallelism !! MIMD parallelism across many cores !! SIMD parallelism within a core !! Multithreading for latency tolerance

!! Asynchronous with host

!! Performance from Parallelism !! Slower clock !! Less ILP !! Simpler control unit !! Smaller caches

22

Page 23: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Abstracted x64+Tesla-10 Architecture

23

Page 24: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Abstracted x64+Fermi Architecture

24

Page 25: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Abstracted x64+Kepler Architecture

25

Page 26: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

GPU Architecture Features

!! Optimized for high degree of regular parallelism

!! Classically optimized for low precision

!! High bandwidth memory (Some have no support for ECC)

!! Highly multithreaded (slack parallelism)

!! Hardware thread scheduling

!! Non-coherent software-managed data caches !! some hardware data cache

!! No multiprocessor memory model guarantees on GPUs !! some guarantees with fence operations within an SM/SMX

26

Page 27: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenACC Abstract Machine Architecture

Page 28: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Abstract Accelerator Target

Page 29: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Parallel Programming on GPUs

!! High degree of regular parallelism !! lots of scalar threads !! threads organized into thread blocks or workgroups

!! SIMD, pseudo-SIMD !! thread blocks organized into grid

!! MIMD !! Languages

!! CUDA, OpenCL, graphics: OpenGL, DirectX !! C++AMP !! Thrust !! may include vector datatypes (float4, double2)

29

Page 30: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Programming

!! Allocate data on the GPU

!! Move data from host, or initialize data on GPU

!! Launch kernel(s) !! GPU driver can generate ISA code at runtime !! preserves forward compatibility without requiring

ISA compatibility !! Gather results from GPU

!! Deallocate data

30

Page 31: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host-side CUDA C Control Code

memsize = sizeof(float)*n; cudaMalloc( &da, memsize ); cudaMalloc( &db, memsize ); cudaMalloc( &dc, memsize ); cudaMemcpy( db, b, memsize, cudaMemcpyHostToDevice ); cudaMemcpy( dc, c, memsize, cudaMemcpyHostToDevice ); dim3 threads( 256 ); dim3 blocks( n/256 ); vaddkernel<<<blocks,threads>>>( da, db, dc, n ); cudaMemcpy( a, da, memsize, cudaMemcpyDeviceToHost ); cudaFree( da ); cudaFree( db ); cudaFree( dc );

31

Page 32: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Device-side CUDA C GPU Code

extern "C" __global__ void vaddkernel( float* a, float* b, float* c, int n )! { int ti = threadIdx.x; /* local index */ int i = blockIdx.x*blockDim.x+ti; /* global index */ a[i] = b[i] + c[i]; }

32

Page 33: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host-side CUDA C Control Code

memsize = sizeof(float)*n; cudaMalloc( &da, memsize ); cudaMalloc( &db, memsize ); cudaMalloc( &dc, memsize ); cudaMemcpy( db, b, memsize, cudaMemcpyHostToDevice ); cudaMemcpy( dc, c, memsize, cudaMemcpyHostToDevice ); dim3 threads( 256 ); dim3 blocks( (n+255)/256 ); vaddkernel<<<blocks,threads>>>( da, db, dc, n ); cudaMemcpy( a, da, memsize, cudaMemcpyDeviceToHost ); cudaFree( da ); cudaFree( db ); cudaFree( dc );

33

Page 34: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Device-side CUDA C GPU Code

extern "C" __global__ void vaddkernel( float* a, float* b, float* c, int n )! { int ti = threadIdx.x; /* local index */ int i = blockIdx.x*blockDim.x+ti; /* global index */ if( i < n ) a[i] = b[i] + c[i]; }

34

Page 35: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host-side CUDA C Control Code

memsize = sizeof(float)*n; cudaMalloc( &da, memsize ); cudaMalloc( &db, memsize ); cudaMalloc( &dc, memsize ); cudaMemcpy( db, b, memsize, cudaMemcpyHostToDevice ); cudaMemcpy( dc, c, memsize, cudaMemcpyHostToDevice ); dim3 threads( 256 ); dim3 blocks( min((n+255)/256,65535) ); vaddkernel<<<blocks,threads>>>( da, db, dc, n ); cudaMemcpy( a, da, memsize, cudaMemcpyDeviceToHost ); cudaFree( da ); cudaFree( db ); cudaFree( dc );

35

Page 36: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Device-side CUDA C GPU Code

extern "C" __global__ void vaddkernel( float* a, float* b, float* c, int n )! { int ti = threadIdx.x; /* local index */ int i = blockIdx.x*blockDim.x+ti; /* global index */ for( ; i < n; i += blockDim.x*gridDim.x ) a[i] = b[i] + c[i]; }

36

Page 37: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

CUDA Behind the Scenes

!! What you write is what you get

!! Implicitly parallel !! threads into warps !! warps into thread groups !! thread groups into a grid

!! Hardware thread scheduler

!! Highly multithreaded

37

Page 38: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Appropriate Accelerator / GPU programs

!! Characterized by nested parallel loops

!! High compute intensity

!! Regular data access

!! Isolated host/GPU data movement

38

Page 39: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Device-side OpenCL GPU Code

__kernel void vaddkernel( __global float* a, __global float* b, __global float* c, int n )! { int i = get_global_id(0); for( ; i < n; i += get_global_size(0) ) a[i] = b[i] + c[i]; }

39

Page 40: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host-side OpenCL Control Code hContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, 0,0,0); clGetContextInfo( hContext, CL_CONTEXT_DEVICES, sizeof(Dev), 0 ); hQueue = clCreateCommandQueue( hContext, Dev[0], 0, 0 ); hProgram = clCreateProgramWithSource( hContext, 1, sProgram, 0, 0 ); clBuildProgram( hprogram, 0, 0, 0, 0, 0 ); hKernel = clCreateKernel( hProgram, “vaddkernel”, 0 ); ha = clCreateBuffer( hContext, 0, memsize, 0, 0 ); hb = clCreateBuffer( hContext, 0, memsize, 0, 0 ); hc = clCreateBuffer( hContext, 0, memsize, 0, 0 ); clEnqueueWriteBuffer( hQueue, hb, 0, 0, memsize, b, 0, 0, 0 ); clEnqueueWriteBuffer( hQueue, hc, 0, 0, memsize, c, 0, 0, 0 ); clSetKernelArg( hKernel, 0, sizeof(cl_mem), (void*)&ha ); clSetKernelArg( hKernel, 1, sizeof(cl_mem), (void*)&hb ); clSetKernelArg( hKernel, 2, sizeof(cl_mem), (void*)&hc ); clSetKernelArg( hKernel, 3, sizeof(int), (void*)&n ); clEnqueueNDRangeKernel( hQueue, hKernel, 1, 0, dims, &n, &bsize, 0, 0, 0 ); clEnqueueReadBuffer( hContext, hc, CL_TRUE, 0, memsize, a, 0, 0, 0 ); clReleaseMemObject( hc ); clReleaseMemObject( hb ); clReleaseMemObject( ha );

40

Page 41: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Host-side CUDA Fortran Control Code

use vaddmod real, device, dimension(:), allocatable :: da, db, dc allocate( da(1:n), db(1:n), dc(1:n) ) db = b dc = c call vaddkernel<<<min((n+255)/256,65535),256>>>( da, db, dc, n ) a = da deallocate( da, db, dc )

41

Page 42: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Device-side CUDA Fortran GPU Code

module vaddmod contains attributes(global) subroutine vaddkernel( a, b, c, n ) real, dimension(*) :: a, b, c integer, value :: n integer :: i i = (blockidx%x-1)*blockdim%x + threadidx%x do i = i, n, blockdim%x*griddim%x a(i) = b(i) + c(i) enddo end subroutine end module

42

Page 43: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Outline

!! PGI

!! Introduction !! CPU Architecture !! Parallel Programming !! Accelerator Architectures !! Accelerator Programming

!! Beginning OpenACC !! Intermediate OpenACC

!! Advanced Topics 43

Page 44: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenACC Programming

!! Simple introductory program

!! Programming model

!! Building OpenACC programs

!! High-level Programming

!! Interpreting Compiler Feedback

!! Tips and Techniques

!! Performance Tuning

44

Page 45: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenACC vector add

#pragma acc kernels loop for( i = 0; i < n; ++i ) a[i] = b[i] + c[i];

45

Page 46: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Why use OpenACC Directives?

!! Productivity !! Higher level programming model !! a la OpenMP

!! Portability !! ignore directives, portable to the host !! portable to other accelerators !! performance portability

!! Performance feedback

46

Page 47: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Basic Syntactic Concepts

!! Fortran OpenACC directive syntax !! !$acc directive [clause]... !! & continuation !! Fortran-77 syntax rules

!! !$acc or C$acc or *$acc in columns 1-5 !! continuation with nonblank in column 6

!! C OpenACC directive syntax !! #pragma acc directive [clause]... eol !! continue to next line with backslash

47

Page 48: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Construct and Region

!! Construct is single-entry/single-exit block of code !! in Fortran, delimited by begin/end directives !! in C, a single statement, or loop, or {...} block !! no jumps into/out of block, no return !! acc parallel and acc kernels compute constructs

!! Region is dynamic range of a construct !! may include procedure calls (in future) !! one construct creates many regions, one for each instance

!! Compute construct contains loops to send to GPU !! loop iterations translated to GPU threads

!! loop indices become threadidx/blockidx indices !! Data Region encloses Compute regions

!! data moved at construct boundaries 48

Page 49: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Appropriate Algorithm

!! Nested parallel loops !! iterations map to threads !! parallelism means threads are independent !! nested loops means lots of parallelism

!! Regular array indexing !! allows for stride-1 array access

49

Page 50: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

#pragma acc data create(newa[0:n-1][0:m-1])\ copy(a[0:n-1][0:m-1]) { do{ change = 0; #pragma acc kernels { for( i = 1; i < m-1; ++i ) for( j = 1; j < n-1; ++j ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = newa; newa = a; a = tmp; }while( change > tolerance ); }

50

Page 51: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Behind the Scenes

!! compiler determines parallelism

!! compiler generates thread code !! split up the iterations into threads, thread groups !! inserts code to user software data cache !! accumulate partial sum !! second kernel to combine final sum

!! compiler also inserts data movement !! compiler or user determines what data to move !! data moved at boundaries of data/compute region

51

Page 52: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Building Accelerator Programs

!! pgfortran –acc a.f90

!! pgcc –acc a.c

!! Other options: !! -ta=nvidia[,cc1x|cc2x|cc3x]

!! default in siterc file: !! set COMPUTECAP=20;

!! -ta=nvidia[,cuda4.2|cuda5.0] !! default in siterc file:

!! set DEFCUDAVERSION=5.0; !! -ta=nvidia,host [default with –acc]

!! Enable compiler feedback with –Minfo or –Minfo=accel !! RECOMMENDED

52

Page 53: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Performance Goals

!! Data movement between Host and Accelerator !! minimize amount of data !! minimize number of data moves !! minimize frequency of data moves !! optimize data allocation in device memory

!! Parallelism on Accelerator !! Lots of MIMD parallelism to fill the multiprocessors !! Lots of SIMD parallelism to fill cores on a multiprocessor !! Lots more MIMD parallelism to fill multithreading parallelism

53

Page 54: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Performance Goals

!! Data movement between device memory and cores !! minimize frequency of data movement !! optimize strides – stride-1 in vector dimension !! optimize alignment – 16-word aligned in vector dimension !! store array blocks in data cache

!! Other goals? !! minimize register usage? !! small kernels vs. large kernels? !! minimize instruction count? !! minimize synchronization points?

54

Page 55: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Program Execution Model

!! Host !! executes most of the program !! allocates accelerator memory !! initiates data copy from host memory to accelerator !! sends kernel code to accelerator !! queues kernels for execution on accelerator !! waits for kernel completion !! initiates data copy from accelerator to host memory !! deallocates accelerator memory

55

Page 56: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Program Execution Model

!! Accelerator !! executes kernels and data transfers, one after another !! concurrently, asynchronous async queues !! three levels of parallelism

!! gang !! across “cores,” CUDA grid

!! worker !! multiple workers within a gang, CUDA warp

!! vector !! vector lanes within a worker, CUDA thread

!! think of gang+worker, or gang+vector

56

Page 57: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Gangs

!! Any gang may execute to completion before another begins

!! No barrier across gangs!

!! Analogous to, but very unlike OpenMP threads

57

Page 58: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

subroutine saxpy(a,x,y,n) integer :: n real :: a, x(n), y(n) !$acc region do i = 1, n x(i) = a*x(i) + y(i) enddo !$acc end region end

saxpy_: . . . movl $.STATICS1+720,%edi

movl $5, %edx

call __pgi_uacc_begin

. . . call __pgi_uacc_enter . . . call __pgi_uacc_daaon . . .

call __pgi_uacc_dataon . . .

call __pgi_uacc_launch . . . call __pgi_uacc_dataoff

. . . call __pgi_uacc_dataoff . . .

!"#$%&'(%)#*%+,-.%/0$"12.3.4)$.5%678%9"5.%%

#include “cuda_runtime.h” #include “pgi_cuda_runtime.h” static __constant__ struct{ int tc1; char* p1; char* p3; float x2; }a2; extern “C” __global__ void saxpy_6_gpu(){ int i1, i1s; i1s = (blockIdx.x)*256; if( i2s >= a2.tc1 ) goto BB5; BB3: i1 = threadIdx.x + i1s; if( i1 >= a2.tc1 ) goto BB4; *(float*)(a2.p3+i1*4) = *(float*)(a2.p1+i1*4) + a2.x2 * *(float*)(a2.p3+i1*4); BB4: i1s = i1s + gridDim.x*256; if( i1s < a2.tc1 ) goto BB3; BB5: ; }

+

83,:.5%);"0$%

9"*<,-.%

-,3=%

.&.90$.%>%3"%9?)32.%$"%.&,#@32%*)=.:-.#A%#94,<$#A%BCD#A%%<4"24)**,32%.3E,4"3*.3$A%.$9;%

PGI Accelerator Compilers

58

Page 59: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Data Construct

!! C #pragma acc data

{ .... }

!! Fortran !$acc data

.... !$acc end data

!! May be nested and may contain compute regions

!! May not be nested within a compute region 59

Page 60: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Data Clauses

!! Data clauses !! copy( list ) !! copyin( list ) !! copyout( list ) !! create( list ) !! present( list ) !! present_or_copy( list ) pcopy( list ) !! present_or_copyin( list ) pcopyin( list ) !! present_or_copyout( list ) pcopyout( list ) !! present_or_create( list ) pcreate( list ) !! deviceptr( list )

60

Page 61: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Update Directives

!! update host( list ) update self( list )

!! update device( list ) !! data must be in a data allocate clause for an enclosing data region !! both may be on a single line

!! update host(list) device(list) !! Data update clauses on data construct (PGI)

!! updatein( list ) or update device( list ) !! updateout( list ) or update host ( list ) !! shorthand for update directive just inside data construct

61

Page 62: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Kernels Construct

!! C #pragma acc kernels { for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; }

!! Fortran !$acc kernels do i = 1,n r(i) = a(i) * 2.0 enddo !$acc end kernels

3-62

Page 63: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Kernels Loop Construct

!! C #pragma acc kernels loop for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

!! Fortran !$acc kernels loop do i = 1,n r(i) = a(i) * 2.0 enddo

3-63

Page 64: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Kernels Construct Clauses

!! Data clauses (and data update clauses (PGI))

!! async( int-expr )

!! if( expr )

!! loop clauses for kernels !! gang [( int-expr )] !! worker [( int-expr )] !! vector [( int-expr )] !! seq !! auto !! private( var-list ) !! reduction( operator: var-list ) !! independent

64

Page 65: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Parallel Construct

!! C #pragma acc parallel num_gangs(100) \ vector_length(128) { #pragma acc loop gang vector for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; }

!! Fortran !$acc parallel num_gangs(100) & vector_length(128) !$acc loop gang vector do i = 1,n r(i) = a(i) * 2.0 enddo !$acc end parallel 3-65

Page 66: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Parallel Loop Construct

!! C #pragma acc parallel loop \ num_gangs(100) vector_length(128) \ gang vector for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

!! Fortran !$acc parallel loop & num_gangs(100) vector_length(128) \ gang vector do i = 1,n r(i) = a(i) * 2.0 enddo

3-66

Page 67: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Parallel Construct Clauses

!! Data clauses (and data update clauses (PGI)) !! async( int-expr ) !! num_gangs( int-expr ) !! num_workers( int-expr ) !! vector_length( int-expr ) !! if( expr ) !! private( var-list ), firstprivate( var-list ) !! reduction( operator : var-list ) !! loop clauses for parallel loop

!! gang, worker, vector, seq, auto !! private( var-list ) !! reduction( operator : var-list )

67

Page 68: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Loop Directive

!! C #pragma acc loop clause... for( i = 0; i < n; ++i ){ .... }

!! Fortran !$acc loop clause... do i = 1, n

68

Page 69: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Performance Profiling

!! TAU (Tuning and Analysis Utilities, University of Oregon) !! collects performance information

!! cudaprof (NVIDIA) !! gives a trace of kernel execution

!! PGI_ACC_TIME = 1 !! dump of region-level and kernel-level performance !! upload/download data movement time !! kernel execution time

!! pgcollect a.out ; pgprof –exe a.out pgprof.out !! PGI_ACC_NOTIFY = 1 or 3 or 7

69

Page 70: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Profiling an Accelerator Model Program

$ make pgf90 -fast -Minfo=ccff ../mm2.f90 ../mmdriv.f90 -o mm2 -ta=nvidia ../mm2.f90: ../mmdriv.f90: $$ pgcollect -time ./mm2 < ../in[...program output...]target process has terminated, writing profile data$$ pgprof -exe ./mm2

"! Build as usual "! Here we add -Minfo=ccff to enhance profile data

"! Just invoke with pgcollect "! Currently collects data similar to PGI_ACC_TIME

"! Invoke the PGPROF performance profiler

70

Page 71: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Performance Profiling

Accelerator Kernel Timing data f3.f90 smooth 24: region entered 1 time time(us): total=1116701 init=1115986 region=715 kernels=22 data=693 w/o init: total=715 max=715 min=715 avg=715 27: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=17 max=10 min=1 avg=3 34: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=5 max=1 min=1 avg=1

Page 72: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Model Profiling

72

Page 73: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Profiling - Region

73

Page 74: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Accelerator Profiling - Kernel

74

Page 75: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Compiler Feedback

75

Page 76: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenACC vs CUDA/OpenCL

!! OpenACC is a high-level implicit programming model for x64+GPU systems, similar to OpenMP for multi-core x64. The OpenACC model: !! Enables offloading of compute-intensive loops and code regions

from a host CPU to an accelerator using simple compiler directives !! Implements directives as Fortran comments and C pragmas, so

programs can remain 100% standard-compliant and portable !! Makes accelerator programming and optimization incremental and

accessible to application domain experts

76

Page 77: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

OpenACC vs CUDA/OpenCL

!! CUDA is a lower-level explicit programming model with substantial runtime library components that give expert programmers direct control of: !! Splitting up of source code into host CPU code and GPU compute

kernel code in appropriately defined functions and subprograms !! Allocation of page-locked host memory, GPU device main memory,

GPU constant memory and GPU shared memory !! All data movement between host main memory and the various

types of GPU memory !! Definition of thread/block grids and launching of compute kernels !! Synchronization of threads within a CUDA thread group !! Asynchronous launch of GPU compute kernels, synchronization

with host CPU !! OpenCL is similar to CUDA, adds some functionality, even

lower level 77

Page 78: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Where to get help

•! PGI Customer Support – [email protected]

•! PGI User's Forum – http://www.pgroup.com/userforum/index.php

•! PGI Articles – http://www.pgroup.com/resources/articles.htm http://www.pgroup.com/resources/accel.htm

•! PGI User's Guide – http://www.pgroup.com/doc/pgiug.pdf

•! CUDA Fortran Reference Guide – http://www.pgroup.com/doc/pgicudafortug.pdf

Page 79: Programming Heterogeneous Parallel Architectures Directive Programming ... · Parallel Programming on CPUs !! Hide / Expose and Virtualize / Explicit !! Instruction level parallelism

Copyright Notice

© Contents copyright 2009-2013, The Portland Group, Inc. This material may not be reproduced in any manner without the expressed written permission of The Portland Group.

79