Compilers & Tools for HPC

Compilers & Tools for HPCJanuary 2014

www.pgroup.com

2

Who is PGI?

The Portland Group (PGI) Founded in 1989 – 20+ years in the HPC businessAcquisition by NVIDIA closed on July 29th 2013

Compilers & SW Development ToolsPGI-developed Compilers, Debugger, Performance Profiler3rd party Integrated Development Environments (IDEs)

HPC & Technical Computing MarketNat’l Labs, Research Labs, Research Universities, Oil & Gas, Aerospace, Pharmaceutical, …

3

PGI Installations

PGI has over 25,000 users at over 5,000 sites worldwide

4

NVIDIA and PGI

World-class HPC companies need world-class compiler teams

NVIDIA + PGI = Integrated HPC system compilersAs part of NVIDIA, PGI will

Continue to create world-class HPC Fortran/C/C++ compilers for CPU+Accelerator systemsAccelerate development and propagation of OpenACC and CUDA FortranIncrease velocity on Accelerator-enablement of HPC applications

5www.pgroup.com

C99, C++, Fortran 2003 CompilersOptimizing, Vectorizing, Parallelizing

Graphical Parallel ToolsPGDBG® debuggerPGPROF® profiler

AMD, Intel, NVIDIA ProcessorsPGI Unified Binary® technologyPerformance portability

Linux, OS X, WindowsVisual Studio & Eclipse integrationPGI Accelerator Features

OpenACC C/C++/Fortran compilersCUDA Fortran compilerCUDA-x86

6

7

The New HPC Node Architecture

8

PGI OpenACC Compilers

9

OpenACC Open Programming Standard for Parallel Computing“PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”

--Buddy Bland, Titan Project Director, Oak Ridge National Lab

“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board

OpenACC Members

10

OpenACC Directives

Program myscience ... serial code ...!$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo!$acc end kernels ...End Program myscience

CPUGPU

OpenACC Compiler Directives

Portable compiler hints

Compiler parallelizes code

Designed for multicore CPUs & many core GPUs / Accelerators

11

How Do OpenACC Compilers Work?

#pragma acc kernels loop for( i = 0; i < nrows; ++i ){ float val = 0.0f; for( d = 0; d < nzeros; ++d ){ j = i + offset[d]; if( j >= 0 && j < nrows ) val += m[i+nrows*d] * v[j]; } x[i] = val; }

matvec: subq $328, %rsp... call __pgi_cu_alloc... call __pgi_cu_uploadx... call __pgi_cu_launch2... call __pgi_cu_downloadx... call __pgi_cu_free...

.entry matvec_14_gpu( ... .reg .u32 %r<70> ... cvt.s32.u32 %r1, %tid.x; mov.s32 %r2, 0; setp.ne.s32 $p1, %r1, %r2 cvt.s32.u32 %r3, %ctaid.x; cvt.s32.u32 %r4, %ntid.x; mul.lo.s32 %r5, %r3, %r4; @%p1 bra $Lt_0_258; st.shared.s32 [__i2s], %r5$Lt_0_258: bar.sync 0;...

+

compile

UnifiedObject execute … no change to existing makefiles, scripts, IDEs,

programming environment, etc.

Link

x86 asm code GPU asm code

12

#pragma acc data \ copy(b[0:n][0:m]) \ create(a[0:n][0:m]) {for (iter = 1; iter <= p; ++iter){ #pragma acc kernels { for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+ w1*(b[i-1][j]+b[i+1][j]+ b[i][j-1]+b[i][j+1])+ w2*(b[i-1][j-1]+b[i-1][j+1]+ b[i+1][j-1]+b[i+1][j+1]); } } for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) b[i][j] = a[i][j]; }}}

S2(B)S1(B)S1(B)S2(B)

OpenACC Coding Example

Host Memory GPU Memory

AA

BB S1(B)

Sp(B)Sp(B)

Sp(B)

13

Matrix Multiply Source Code Size

Comparison:

1 void 2 computeMM0_saxpy(float C[][WB], float A[][WA], float B[][WB], 3 int hA, int wA, int wB) 4 { 5 #pragma acc region 6 { 7 #pragma acc for parallel vector(16) unroll(4) 8 for (int i = 0; i < hA; ++i) { 9 for (int j = 0; j < wB; ++j) { 10 C[i][j] = 0.0 ; 11 } 12 for (int k = 0; k < wA; ++k) { 13 for (int j = 0; j < wB; ++j) { 14 C[i][j] = C[i][j]+A[i][k]*B[k][j]; 15 } 16 } 17 } 18 } 19 } }

1 void matrixMulGPU(cl_uint ciDeviceCount, cl_mem h_A, float* h_B_data, 2 unsigned int mem_size_B, float* h_C )

2 { 3 cl_mem d_A[MAX_GPU_COUNT]; 4 cl_mem d_C[MAX_GPU_COUNT]; 5 cl_mem d_B[MAX_GPU_COUNT]; 6

7 cl_event GPUDone[MAX_GPU_COUNT]; 8 cl_event GPUExecution[MAX_GPU_COUNT]; 9 12 // Create buffers for each GPU

13 // Each GPU will compute sizePerGPU rows of the result 14 int sizePerGPU = HA / ciDeviceCount; 15 16 int workOffset[MAX_GPU_COUNT];

17 int workSize[MAX_GPU_COUNT]; 18 19 workOffset[0] = 0; 20 for(unsigned int i=0; i < ciDeviceCount; ++i)

21 { 22 // Input buffer 23 workSize[i] = (i != (ciDeviceCount - 1)) ? sizePerGPU : (HA - workOffset[i]); 24 25 d_A[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, workSize[i] * sizeof(float) * WA, NULL,NULL);

26 27 // Copy only assigned rows from host to device 28 clEnqueueCopyBuffer(commandQueue[i], h_A, d_A[i], workOffset[i] * sizeof(float) * WA, 29 0, workSize[i] * sizeof(float) * WA, 0, NULL, NULL);

30 31 // create OpenCL buffer on device that will be initiatlize from the host memory on first use 32 // on device 33 d_B[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

34 mem_size_B, h_B_data, NULL); 35 36 // Output buffer 37 d_C[i] = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, workSize[i] * WC * sizeof(float), NULL,NULL);

38 39 // set the args values 40 clSetKernelArg(multiplicationKernel[i], 0, sizeof(cl_mem), (void *) &d_C[i]); 41 clSetKernelArg(multiplicationKernel[i], 1, sizeof(cl_mem), (void *) &d_A[i]);

42 clSetKernelArg(multiplicationKernel[i], 2, sizeof(cl_mem), (void *) &d_B[i]); 43 clSetKernelArg(multiplicationKernel[i], 3, sizeof(float) * BLOCK_SIZE *BLOCK_SIZE, 0 ); 44 clSetKernelArg(multiplicationKernel[i], 4, sizeof(float) * BLOCK_SIZE *BLOCK_SIZE, 0 ); 45 46 if(i+1 < ciDeviceCount)

47 workOffset[i + 1] = workOffset[i] + workSize[i]; 48 } 49 // Execute Multiplication on all GPUs in parallel 50 size_t localWorkSize[] = {BLOCK_SIZE, BLOCK_SIZE};

51 size_t globalWorkSize[] = {shrRoundUp(BLOCK_SIZE, WC), shrRoundUp(BLOCK_SIZE, workSize[0])}; 52 // Launch kernels on devices 53 for(unsigned int i = 0; i < ciDeviceCount; i++) 54 {

55 // Multiplication - non-blocking execution 56 globalWorkSize[1] = shrRoundUp(BLOCK_SIZE, workSize[i]); 57 clEnqueueNDRangeKernel(commandQueue[i], multiplicationKernel[i], 2, 0, globalWorkSize, localWorkSize, 58 0, NULL, &GPUExecution[i]);

59 } 60 for(unsigned int i = 0; i < ciDeviceCount; i++) 61 { 62 clFinish(commandQueue[i]); 63 }

64 for(unsigned int i = 0; i < ciDeviceCount; i++) 65 { 66 // Non-blocking copy of result from device to host 67 clEnqueueReadBuffer(commandQueue[i], d_C[i], CL_FALSE, 0, WC * sizeof(float) * workSize[i],

68 h_C + workOffset[i] * WC, 0, NULL, &GPUDone[i]); 69 } 70 // CPU sync with GPU 71 clWaitForEvents(ciDeviceCount, GPUDone);

72 73 // Release mem and event objects 74 for(unsigned int i = 0; i < ciDeviceCount; i++) 75 {

76 clReleaseMemObject(d_A[i]); 77 clReleaseMemObject(d_C[i]); 78 clReleaseMemObject(d_B[i]); 79 clReleaseEvent(GPUExecution[i]); 80 clReleaseEvent(GPUDone[i]);

81 } 82 } 83 __kernel void 84 matrixMul( __global float* C, __global float* A, __global float* B,

85 __local float* As, __local float* Bs) 86 { 87 int bx = get_group_id(0), tx = get_local_id(0); 88 int by = get_group_id(1), ty = get_local_id(1);

89 int aEnd = WA * BLOCK_SIZE * by + WA - 1; 90 91 float Csub = 0.0f; 92

93 for (int a = WA*BLOCK_SIZE*by , b = BLOCK_SIZE * bx; 94 a <= aEnd; a += BLOCK_SIZE, b += BLOCK_SIZE*WB) { 95 As[tx + ty * BLOCK_SIZE] = A[a + WA * ty + tx]; 96 Bs[tx + ty * BLOCK_SIZE] = B[b + WB * ty + tx]; 97 barrier(CLK_LOCAL_MEM_FENCE);

98 for (int k = 0; k < BLOCK_SIZE; ++k) 99 Csub += As[k + ty * BLOCK_SIZE]*Bs[tx + k * BLOCK_SIZE] ; 101 barrier(CLK_LOCAL_MEM_FENCE); 102 }

103 C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub; 104 105 } OpenCLOpenACC CUDA C

1 void 2 __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) 3 { 4 int bx = blockIdx.x; 5 int by = blockIdx.y; 6 int tx = threadIdx.x; 7 int ty = threadIdx.y; 8 int aBegin = wA * BLOCK_SIZE * by; 9 int aEnd = aBegin + wA - 1; 10 int aStep = BLOCK_SIZE; 11 int bBegin = BLOCK_SIZE * bx; 12 int bStep = BLOCK_SIZE * wB; 13 float Csub = 0; 14 for (int a = aBegin, b = bBegin; 15 a <= aEnd; 16 a += aStep, b += bStep) { 17 __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 18 __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 19 AS(ty, tx) = A[a + wA * ty + tx]; 20 BS(ty, tx) = B[b + wB * ty + tx]; 21 __syncthreads(); 22 for (int k = 0; k < BLOCK_SIZE; ++k) 23 Csub += AS(ty, k) * BS(k, tx); 24 __syncthreads(); 25 } 26 int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; 27 C[c + wB * ty + tx] = Csub; 28 } 29 30 void 31 domatmul( float* C, float* A, float* B, unsigned int hA, unsigned int wA , unsigned wB ) 32 { 33 unsigned int size_A = WA * HA; 34 unsigned int mem_size_A = sizeof(float) * size_A; 35 unsigned int size_B = WB * HB; 36 unsigned int mem_size_B = sizeof(float) * size_B; 37 unsigned int size_C = WC * HC; 38 unsigned int mem_size_C = sizeof(float) * size_C; 39 float *d_A, *d_B, *d_C; 40 41 cudaMalloc((void**) &d_A, mem_size_A); 42 cudaMalloc((void**) &d_B, mem_size_B); 43 cudaMalloc((void**) &d_C, mem_size_C); 44 cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); 45 cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice); 46 47 dim3 threads(BLOCK_SIZE, BLOCK_SIZE); 48 dim3 grid(WC / threads.x, HC / threads.y); 49 50 matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); 51 52 cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost); 53 54 cudaFree(d_A); 55 cudaFree(d_B); 56 cudaFree(d_C); 57 }

14

AcceleratingSEISMIC_CPMLfrom the Universityof Pau

Read this article online at www.pgroup.com/pginsider

15

SEISMIC_CPML Timings

VersionMPI

ProcessesOpenMP Threads GPUs Time (sec)

Approx. Programming

Time (min)

Original MPI/OMP 2 4 0 951

ACC Steps 1/2 2 0 2 3100 10

ACC Step 3 2 0 2 550 60

ACC Step 4 2 0 2 124 120

ACC Step 5 2 0 2 120 120

5x in 5 hours!

System Info:4 Core Intel Core-i7 920 Running at 2.67GhzIncludes 2 Tesla C2070 GPUsProblem Size: 101x641x128

16

Typical Porting Experience with OpenACC Directives

17

Cloverleaf mini-App Performance

bm_short bm bm16_short bm16 1

10

100

1000

K20X CUDAK20X OpenACCDual-socket CPU CDual-socket CPU Fortran

Cloverleaf is a US National Labs Trinity/Coral mini-app benchmark https://github.com/Warwick-PCAV/CloverLeaf/wiki/Performance-Table

Run

-tim

e

Better

NVIDIA benchmarks: dual-socket Intel Xeon E5-2667

https://github.com/Warwick-PCAV/CloverLeaf/wiki/Performance-Table

18

OpenACC: Performance with Less Effort

02000400060008000

100001200014000160001800020000

OpenACCCUDAOpenCL

Words of Code

Cloverleaf: http://www.computer.org/csdl/proceedings/sccompanion/2012/4956/00/4956a465-abs.html

http://www.computer.org/csdl/proceedings/sccompanion/2012/4956/00/4956a465-abs.html



19

Why Use OpenACC Directives?

ProductivityHigher-level programming modelSimilar to OpenMP, designed for Accelerated computing

PortabilityIgnore directives, code is portable to the hostPortable across different types of AcceleratorsPerformance portability

Performance Feedback InformationUnique to compilersEnables incremental porting and tuning

20

PGI CUDA Fortran

21

attributes(global) subroutine mm_kernel ( A, B, C, N, M, L )real :: A(N,M), B(M,L), C(N,L), Cijinteger, value :: N, M, Linteger :: i, j, kb, k, tx, tyreal, shared :: Asub(16,16),Bsub(16,16)tx = threadidx%xty = threadidx%yi = blockidx%x * 16 + tx j = blockidx%y * 16 + tyCij = 0.0do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) Bsub(tx,ty) = B(kb+ty-1,j) call syncthreads() do k = 1,16 Cij = Cij + Asub(tx,k) * Bsub(k,ty) enddo call syncthreads()enddoC(i,j) = Cij end subroutine mmul_kernel

real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev

. . . allocate (Adev(N,M), Bdev(M,L), Cdev(N,L))Adev = A(1:N,1:M) Bdev = B(1:M,1:L) call mm_kernel <<<dim3(N/16,M/16),dim3(16,16)>>> ( Adev, Bdev, Cdev, N, M, L) C(1:N,1:L) = Cdev deallocate ( Adev, Bdev, Cdev ) . . .

Host Code GPU Code

CUDA Fortran in a Nutshell

22

CUDA Fortran / OpenACC Interoperabilitymodule mymod real, dimension(:), allocatable, device :: xDevend module...use mymod...allocate( xDev(n) ) ! allocates xDev in GPU memorycall init_kernel <<<dim3(n/128),dim3(128)>>> (xDev, n)...!$acc data copy( y(:) ) ! no need to copy xDev ...!$acc kernels loop do i = 1, n y(i) = y(i) + a*xDev(i) enddo ...!$acc end data

23

CUDA Fortran Supports Generic Interfaces and Overloading

use cublas

real(4), device :: xd(N)real(4) x(N)call random_number(x)

! Alloc xd in device memory, copy x! to xd, invoke overloaded isamaxallocate(xd(N))xd = x j = isamax(N,xd,1)

! On the host, same namek = isamax(N,x,1)

module cublas! isamax interface isamax integer function isamax & (n, x, incx) integer :: n, incx real(4) :: x(*) end function

integer function isamaxcu & (n, x, incx) bind(c, & name='cublasIsamax') integer, value :: n, incx real(4), device :: x(*) end function end interface. . .

24

CUDA Fortran Supports EncapsulationIsolate device data and accelerator kernel declarations in Fortran modules

module mm real, device, allocatable :: a(:) real, device :: x, y(10) real, constant :: c1, c2(10) integer, device :: n

contains attributes(global) subroutine s( b )

...

Partition source into sections written and maintained by accelerator experts vs those evolved by science and engineering domain experts

25

!$CUF Kernel DirectivesSimplifies Kernel Creationmodule madd_device_module use cudaforcontains subroutine madd_dev(a,b,c,sum,n1,n2) real,dimension(:,:),device :: a,b,c real :: sum integer :: n1,n2 type(dim3) :: grid, block!$cuf kernel do (2) <<<(*,*),(32,4)>>> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) sum = sum + a(i,j) enddo enddo end subroutineend module

module madd_device_module use cudafor implicit nonecontains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex real :: mysum real, shared :: bsum(256)! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block tindex = threadidx%x + (threadidx%y-1)*blockdim%x ! Store this thread's partial sum in the shared memory block bsum(tindex) = mysum call syncthreads() ! Accumulate all the partial sums for this thread block to a single value tneighbor = 128 do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 call syncthreads() enddo ! Store the partial sum for the thread block bindex = blockidx%x + (blockidx%y-1)*griddim%x if( tindex == 1 ) blocksum(bindex) = bsum(1) end subroutine

! Add up partial sums for all thread blocks to a single cumulative sum attributes(global) subroutine madd_sum_kernel(blocksum,dsum,nb) real, dimension(:) :: blocksum real :: dsum integer, value :: nb real, shared :: bsum(256) integer :: tindex,tneighbor,i ! Again, we assume 256 threads in the thread block ! accumulate a partial sum for each thread tindex = threadidx%x bsum(tindex) = 0.0 do i = tindex, nb, blockdim%x bsum(tindex) = bsum(tindex) + blocksum(i) enddo call syncthreads() ! This code is copied from the previous kernel ! Accumulate all the partial sums for this thread block to a single value ! Since there is only one thread block, this single value is the final result tneighbor = 128 do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 call syncthreads() enddo if( tindex == 1 ) dsum = bsum(1) end subroutine

subroutine madd_dev(a,b,c,dsum,n1,n2) real, dimension(:,:), device :: a,b,c real, device :: dsum real, dimension(:), allocatable, device :: blocksum integer :: n1,n2,nb type(dim3) :: grid, block integer :: r ! Compute grid/block size; block size must be 256 threads grid = dim3((n1+31)/32, (n2+7)/8, 1) block = dim3(32,8,1) nb = grid%x * grid%y allocate(blocksum(1:nb)) call madd_kernel<<< grid, block >>>(a,b,c,blocksum,n1,n2) call madd_sum_kernel<<< 1, 256 >>>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early deallocate(blocksum) end subroutine end module

Equivalenthand-written

CUDA kernels

26

2014

27

PGI 2014 Multi-core x64 Highlights

150%

155%

160%

165%

170%

175%

180%

Performance Tuning:OpenMP 75% faster

than GCC

Comprehensive MPI support

Free PGI for your MacBook

28

Industry-leading Multi-core x86 Performance

PGI 14.1

Intel 14.0.1

GCC 4.8.1

0% 20% 40% 60% 80% 100% 120% 140% 160% 180% 200%

AMD Piledriver 2p/32 coresIntel Sandy Bridge 2p/16 cores (hyperthreading enabled)

SPECompG_base2012 relative performance as measured by The Portland Group during the week of July 29, 2013. The number of OpenMP threads was set to match the number of cores on each system. SPEComp® is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).

29

Comprehensive MPI Support

Open MPI

MVAPICH2

MVAPICH2MPICH3 SGI MPI

Debug Run Profile

30

What is in Free PGI for OS X?

OpenMP 3.1 & auto-parallelizing Fortran 2003 and C99 compilers

Optimized for the latest multi-core x86-64 CPUs

Supported on Mountain Lion and Mavericks with

Xcode 5

Includes a cmd-level parallel Fortran debugger

31

PGI Accelerator 2014 Highlights

NVIDIA Tesla K40 and AMD Radeon

GPUs Support

OpenACC 2.0 Features and Optimizations

CUDA Fortran and OpenACCDebugging

32

OpenACC Performance Portability

Platform0

2

4

6

8

10

12

14

16Average Speed-up Across 16 Benchmarks on Tesla and Radeon

CPUGPU AGPU B

Spee

d-up

CPU results are one core of an Intel Core i7-3930 CPU @ 3.20GHz (Sandy Bridge).

33

PGI 14.1 OpenACC New FeaturesAccelerator-side procedure calls using the routine directiveUnstructured data lifetimes using enter_data, exit_data Comprehensive host_data support in Fortrandeclare create, declare device_residentFortran deviceptr data clauseMultidimensional dynamically allocated C/C++ arraysOpenACC 2.0 API function callsCalling of CUDA Fortran atomic functions in OpenACC regions Tuning, tuning, tuning … driven by SPEC ACCEL bmks, COSMO, NIM, Cloverleaf, WRF, Gaussian, proprietary customer codes, …

34

OpenACC Procedure Calls

void matvec( ... ){ float s = 0.0; #pragma acc loop vector reduction(+:s) for(int j=0; j<n; ++j) s += a[i*n+j]*v[j]; x[i] = s;}

void test( ... ){ #pragma acc parallel loop gang ... for(int i=0; i<n; ++i) matvec(a, v, x, i, n);}

% pgcc -acc -Minline test.c

#pragma acc routine vectorvoid matvec(...){ float s = 0.0; #pragma acc loop vector reduction(+:s) for(int j=0; j<n; ++j) s += a[i*n+j]*v[j]; x[i] = s;}

void test(...){ #pragma acc parallel loop gang ... for(int i=0; i<n; ++i) matvec(a, v, x, i, n);}

% pgcc -acc -c matvec.c% pgcc -acc -c test.c% pgcc -acc test.o matvec.o

Inlining required, single file Separate compilation + link step

Simplifies porting, allows separate compilationand libraries of device-side procedures

35

OpenACC Unstructured Data Lifetimes

alloc(){ x = (float*)malloc(...); ...

}

do_init(){ alloc();

for( i=0; i<n; ++i) x[i] = ...}

int main(){ do_init(); #pragma acc data copy(x[0:n]) for( time=0; time<n; ++time) process( time );

genoutput();}

alloc(){ x = (float*)malloc(...); ... #pragma acc enter data create(x[0:n])}

do_init(){ alloc(); #pragma acc parallel present(x[0:n]) for( i=0; i<n; ++i) x[i] = ...}

int main(){ do_init(); #pragma acc data present(x[0:n]) for( time=0; time<n; ++time) process( time ) #pragma acc exit data copyout(x[0:n]) genoutput();}

Initialization on host Initialization on device

Enables fine-tuning of data movementthrough dynamic data management

36

OpenACC Host_data Directive

void cudaproc(...){ cudakernel<<<n/128,128>>>( a, n );}void test(...){#ifdef CUDA cudaMalloc( &a, sizeof(float)*n );#else a = malloc( sizeof(float)*n );#endif ... #pragma acc parallel loop deviceptr(a) for( i=0; i<n; ++i ) a[i] = ...#ifdef CUDA cudaproc( a, n );#else hostproc( a, n );#endif}

% nvcc -c cproc.cu% pgcc -DCUDA -acc test.c cproc.o -Mcuda

void cudaproc(...){ cudakernel<<<n/128,128>>>( a, n );}void test(...){ a = malloc( sizeof(float)*n ); ... #pragma acc data copy(a[0:n]) { #pragma acc parallel loop for( i=0; i<n; ++i ) a[i] = ... if( usingcuda){ #pragma acc host_data use_device(a) cudaproc( a, n ); }else{ hostproc( a, n ); }}

% nvcc -c cproc.cu% pgcc -acc test.c cproc.o -Mcuda

cudaMalloc needed host_data allows runtime test

Enhances interoperability of CUDA andOpenACC, enables use of GPUDirect

37

OpenACC Atomic Operations

#pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) {…… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) {…… for(int k = 0; k < numneighs; k++) { j = neighs[k];…… if(GHOST_NEWTON || j < nlocal) { f[3 * j + 0] -= delx * force; f[3 * j + 1] -= dely * force; f[3 * j + 2] -= delz * force; }

#pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) {…… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) {…… for(int k = 0; k < numneighs; k++) { j = neighs[k];…… if(GHOST_NEWTON || j < nlocal) { #pragma acc atomic update f[3 * j + 0] -= delx * force; #pragma acc atomic update f[3 * j + 1] -= dely * force; #pragma acc atomic update f[3 * j + 2] -= delz * force; }

miniMD contains a race condition With OpenACC atomic

Fails – requires total re-write without supportfor OpenACC 2.0 atomic directives, currentlyscheduled for PGI 14.4, in the meantime …

38

OpenACC Atomic Operations

#pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) {…… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) {…… for(int k = 0; k < numneighs; k++) { j = neighs[k];…… if(GHOST_NEWTON || j < nlocal) { f[3 * j + 0] -= delx * force; f[3 * j + 1] -= dely * force; f[3 * j + 2] -= delz * force; }

#pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) {…… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) {…… for(int k = 0; k < numneighs; k++) { j = neighs[k];…… if(GHOST_NEWTON || j < nlocal) { //#pragma acc atomic update atomicSubf(f[3 * j + 0], delx * force); //#pragma acc atomic update atomicSubf(f[3 * j + 1], dely * force); //#pragma acc atomic update atomicSubf(f[3 * j + 2], delz * force); }

miniMD contains a race condition Calling CUDA atomics in OpenACC regions

39

Debugging CUDA Fortran with Allinea DDT

40

PGI compilers enable performance-portable programming across CPU+Accelerator systems

and minimize re-coding for each new HW advancement

Key Value-add of the PGI Compilers

41

Documents

Compilers & Tools for HPC