Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

Speaker name, TitleCompany/Organization Name

Join the Conversation #OpenPOWERSummit

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

Leopold Grinberg, IBM/Research, T.J. Watson Center

[email protected]

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

for (i=0; i < N; ++i)

data[i] = …

2

42 TF and ~ 6TB/s memory BW

3

AC922

Memory: system main memory, GPU’s HBM.

Concurrency: ~1M threads running on the

CPUs and GPUs.

Data and memory management Execution policy, and expressing parallelism

Designing portable and performance portable code

Unified Addressing Compiler Directive based programming

Examples: • Memory/data management with OpenMP4.5 directives and

Unified Addressing• Nested data structures, std::vector• Nested parallelism• Examples from CORAL-1 benchmarks (LULESH/etc..)• Asynchronous execution

Hardware

Challenge

Strategy

Value

4

AC922: POWER9 + V100 + NVLink 2.0

V-100: 80 SMs; up to 2048 threads per SM; up to 32 CUDA blocks per SM

POWER9: 22 cores ; 4 hardware threads per core; NVLink 2.0; PCIe 4

5

Challenge: Keeping 1M Threads on a Single Node Busy

(22*4*2=) 176 CPU threads + (80*2048*6=) 983,040 GPU threads

6

HBM2 HBM2 HBM2

DDR4

HBM2 HBM2

DDR4

HBM2

Challenge: Managing Multiple Memories and ~0.5 TB of data

7

Programming Languages and Compilers on OpenPOWER

Key Features

• Direct access to the GPU instruction set

• When leveraging NVIDIA GPUs, generally achieves best performance

• Compilers: XL Fortran, NVCC, PGI CUDA Fortran

• Host compilers: GCC, XL, PGI, CLANG

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Platform/accelerator portable

• Fallback execution for safety

• Compilers: IBM XL, LLVM/Clang compiler, GCC

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Directive based parallelization for accelerator device

• Compilers: PGI, GCC

CUDA

8

Why use OpenMP 4.x ?

The ultimate goal for developers using OpenMP4.0 and beyond is to achieve:

a) portability

b) performance portability

while using the same source code and compiling it on different platforms.

OpenMP4.5 allows incremental transition of applications:non-threaded codes can be first parallelized using OpenMP directives (if algorithm allows parallelization) tested on the host (CPU) and then offloaded to the device (GPU)

for (i=0; i<N;i++)y[i] = a*x[i]+y[i]

#pragma omp parallel forfor (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for if(0)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N]) if(target:1)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

9

Code comments effort No offloading With offloading comment

LULESH XLC, BW limited

2-3 days FOM: 17,000 / node FOM: 196,000 / node 27 nodes

AMG2013 XLC, Read BW limited, cuSparse

< week FOM: 0.7e+08 / node FOM: 9.4e+08 / node 1 node

HPCG** CLANG, Read BW limited

3 weeks FOM: 15.8 FOM: 197 1 node

Quicksilver “GPU hostile code”, load balancing issues… tracking kernel time.

1 mo. code restructuring. 2 days porting

35 s 26.7 s (2CPUs + 4 GPUs) 1 node

Opacitylibrary*

Table lookups, integer arithmetic.

~3 weeks Speedup: 1x Up to 4x with data transfersup to 30x with data in GPU

1 P8 vs. 1 P-100

Why use OpenMP 4.x ?

12x

12x

12x

Simulations on IBM Minsky nodes (2 POWER8 CPUs and 4 P-100 GPUs)*Joint work with LLNL and IBM; **Sequential Gauss-Seidel has been replaced with multi-colored Gauss-Seidel

10

Managing memory

Defining execution

space

Managing data

Nesting parallel regions

Coarse-grain parallelism

Fine-grain parallelism

Intra-node communication

Inter-node communication

OPENMP 4.5

Programming with OpenMP 4.5

11

Challenge: managing memory and data with OpenMP4.5

HBM2 HBM2

DDR

HBM2

Managing memory:• Memory allocation/deallocation• Use of memory pools • Use of Unified Addressing• Preventing page migration

Managing data:• Replication of data on different memories + synchronization• Placing data in buffers provided by memory pools • Use of Unified Addressing • [Random] switching execution between HOST and DEVICE

requires careful synchronization• Using [same] data by HOST and DEVICES concurrently

and/or in stages

12

Managing memory and data

Memory/data management using

OpenMP directives and API

Memory/data management using Unified Addressing

Memory/data management while mixing

OpenMP directives/API and Unified Addressing

13

Managing memory and data using OpenMP 4.5Managing memory and data is typically the first task that developers have to tackle while porting applications from homogeneous to heterogeneous system.

OpenMP4.5 provides with a number of options:

1. Use of directives (map, enter data, release, delete,….)

2. Use of OpenMP API calls (omp_target_alloc, omp_target_memcopy, …)

3. Mixing 1 and 2 by allocating memory on the device with omp_target_alloc or CUDA APIs and on the host with malloc, associating pointers using omp_target_associate_ptr and applying directives (map, update, …)

14

Managing memory and data using OpenMP 4.5* -data replication model

“…The syntax of the map clause is as follows:map([ [map-type-modifier[,]] map-type : ] list)where map-type is one of the following:..“to

from

tofrom

alloc

release

delete

map-type-modifier is always.

Allocate* device memory;

copy* from CPU to GPU; deallocate* device memoryallocate* device memory;

copy* from GPU to CPU;deallocate* device memory allocate* device memory;

copy* from CPU to GPU; {kernel};

copy from GPU to CPU; deallocate* memoryallocate* device memory

on GPU

reduce reference count; and possibly delete…

deallocate* device memory

*IBM’s implementation

15

Managing Memory and Data: Example

double *x, *y;int N = 3238289;

x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(from:y[0:N])for (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

16


x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory


OpenMP runtime will automatically detect if

arrays x and y are mapped

17



x = new double[N]; y = new double[N]; for (i = 0; i < N; ++i) x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i) y[i] = sin(x[i])*cos(x[i]);

functionA(x,y,N);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

functionA(double *x, double *y, int N){#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N])for (i = 0; i < N; ++i) y[i] = 2.0*x[i] + y[i];

}

x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

Use case for reference counterssince x and y are mapped, maps here will be

effectively noops

18

Developers working on codes for simulations on CPUs and GPUs can also mix CUDA API and OpenMP4.5 directives/API call for memory/data management.

For example, in our implementation of AMG2013 benchmark we use OpenMP4.5 for memory allocation and data initialization, and some kernels, while we use cuSparse library for optimized to GPUs sparse matrix vector multiplications.

#pragma omp target enter data map(to: x_data[0:x_size],y_data[0:y_size])…#pragma omp target data use_device_ptr(x_data[0:x_size],y_data[0:y_size])

{cusparseDcsrmv(cu_spmv_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, num_rows, num_cols, nnz,

&alpha, cu_spmv_descr, d_A_data, d_A_i, d_A_j, x_data, &beta, y_data);}

Warning: Making code portable to different types of architectures will require

additional work! cuSparse is not available on systems without NVIDIA GPUs.

Managing memory and data using OpenMP 4.5:interoperability with NVIDIA libraries

19

Managing memory/data: deeply nested data structures

20

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _taskClass MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

qs_vector<MC_Domain> domain

Class MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _task

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total

Class MC_Mesh_Domain:qs_vec<int> _nbrDomainGidqs_vec<int> _nbrRank

qs_vec<MC_Vector> _nodeqs_vec<MC_Facet_Adjacency_Cell> _cellConnectivityqs_vec<MC_Facet_Geometry_Cell> cell_geometry

Class MC_Vector { double; double double}

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total Class : task_precomputed_multigroup_macroscopic_cross_sections_type

qs_vector<double> _total

Class MC_Vector { double; double double}Class MC_Vector { double; double double}

Class MC_Vector { double; double; double}

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet





Class Subfacet_AdjacencyMC_Subfacet_Adjacency_Event::Enum

event; MC_Location current;MC_Location adjacent;

Class MC_Location { int; int; int}

Class MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facet

Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}

21


a {*y, size} y[0:N]

a {*y, size} y[0:N]

CPU Memory

GPU Memory

22


#pragma omp target enter data map(to:a[0:1])#pragma omp target enter data map(to:a->y[0:N])#pragma omp target{

a->y[3] += …}

#pragma omp target exit data map(release:a->y[0:N])#pragma omp target exit data map(release:a[0:1])

#pragma omp target data map(to:y[0:n], a[0:1]) #pragma omp target{

a->y = y;}#pragma omp target{

a->y[3] += …}

a {*y, size} y[0:N]

a {*y, size} y[0:N]

CPU Memory

GPU Memory

23

Managing memory/data: Unified Addressing

Example: std::vector

24

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

{

T* ptr;

#ifdef USE_CUDA_MANAGED

cudaMallocManaged(&ptr, n*sizeof(T));

#else

ptr = (T*) malloc(n*sizeof(T));

#endif

return ptr;

}

void deallocate(T* p, std::size_t n)

{


cudaFree(p);

#else

free(p);

#endif

}

};


std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

25

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

{

T* ptr;


#ifdef USE_ATS

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0); //required today performance reason

cudaDeviceSynchronize();

#endif

return ptr;

}

void deallocate(T* p, std::size_t n)

{

free(p);

}

};

std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

We expect that in the future prefetching will be handled by the OS, and CUDA API will not be required

Managing memory/data:deeply nested data structures

Will not work on systems not supporting ATS

26

Managing memory/data on systems with ATS*

(address translation service)

*Part of this presentation includes IBM’s extensions for OpenMP4.5 and features of OpenMP5.0 already implemented in compilers supporting OpenMP4.5

27

int main(){

int N=20;

double *data = new double[N];

omp_set_default_device(0);

#pragma omp target teams distribute parallel for map(from:data[0:N])

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

printf("data[%d] = %g\n",i,data[i]);

delete [] data;

return 0;

}

Managing memory/data on systems with ATS

28

int main(){

int N=20;



#pragma omp target teams distribute parallel for is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)


delete [] data;

return 0;

}

Systems with ATS enabled


29

int main(){

int N=20;



#pragma omp target teams distribute parallel for //is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)


delete [] data;

return 0;

}

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT


30

int main(){

int N=20;

double *data, *data2;


data = new double[2];

data2 = new double[N];

#pragma omp target teams distribute parallel for map(from:data2[0:N])//is_device_ptr(data)

for (int i = 0; i < N; ++i)

data2[i] = data[i%2] + i*0.1;

delete [] data;

delete [] data2;

return 0;

}

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT


Here “map” is not ignored,memory for data2 is allocated on the device and content of data2 is being copied from device to host

31

subroutine foo(a,n)

real*8, dimension(n) :: a

!$omp target teams distribute parallel do

do i=1,N ; a(i)=0.1*i; end do

end

program test_implicit_ats

integer, parameter :: N=20

real*8, dimension(:), allocatable :: data

allocate(data(N))

call foo(data,N)

print *,data(10)

end

nvprof ./a.out

1.9520us 160B 78.170MB/s Pinned Device Tesla V100-SXM2 [CUDA memcpy HtoD]

1.6000us - Tesla V100-SXM2 __xl_foo_l3_OL_1 [156]

2.0480us 160B 74.506MB/s Device Pinned Tesla V100-SXM2 [CUDA memcpy DtoH]

export XLSMPOPTS=TARGETMEM=UIMPLICITnvprof ./a.out

7.1040us - - - Tesla V100-SXM2 __xl_foo_l3_OL_1 [150]


Fortran

Contributed by Lixiang Luo, IBM Research

32

Simulations with 1 MPI rank/GPU Simulations with 2 MPI ranks/GPU [+MPS]

#ifdef USE_ATS


cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);


#else …


cudaMallocManaged(&ptr, n*sizeof(T));

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);


#else …

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 166.7 312,782 327,581

1728 288 308,760 328,513

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 83.3 332,393 358,852

1728 144 331,248 358,538

LULESH: performance with ATS and CUDA Managed Memory

33

OpenMP: Nested Parallel regions on CPUs and GPUs

34

Nested parallelism + concurrent execution on all devices

35

Nested parallelism + concurrent execution on all devices int main(){

double *x, *y;double DEVICE_FRACTION=0;int num_devices, i, chunk, j_start, N = 1024*1024*10;bool USE_DEVICE;x = new double[N]; y = new double[N];//enable nested parallelismomp_set_nested(1);//get number of devicesnum_devices = omp_get_num_devices();if (num_devices>0) DEVICE_FRACTION=0.9;#pragma omp parallel for num_threads(num_devices+1) private(chunk, j_start, USE_DEVICE)for( i < (num_devices+1); ++i){

if (i < num_devices){ omp_set_default_device(i);chunk = DEVICE_FRACTION * N / num_devices;j_start = chunk*I;USE_DEVICE=true;

} else {chunk = N; //defaultj_start = 0; //defaultUSE_DEVICE=false; //defaultif (num_devices > 0){

j_start = (DEVICE_FRACTION * N / num_devices) * num_devices;chunk = N – j_start;

}}initialize_x_and_y( x+j_start, y+j_start, chunk, j_start, USE_DEVICE);

}free(x); free(y);return 0;}

void initialize_x_and_y(double *x, double *y, int N, int offset, bool USE_DEVICE){#pragma omp target teams distribute parallel for map(from:x[0:N], y[0:N]) if(target:USE_DEVICE)

for (int i=0; i < N; ++i){x[i] = (offset + i) * 0.001; y[i] = (offset + i) * 0.003;

}}

36

Nested parallelism: communication in LULESH#pragma omp parallel sections private(pmsg,emsg,cmsg,destAddr){

#pragma omp section{

if (planeMin | planeMax) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0 ; fi<xferFields; ++fi) { for (Index_t i=0; i<sendCount; ++i) { destAddr[i+sendCount*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

}

#pragma omp section{

if (rowMin && planeMin && not_planeOnly) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm + emsg * maxEdgeComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0; fi<xferFields; ++fi) { for (Index_t i=0; i<dx; ++i) { destAddr[i + dx*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

}}…..

37

Nested parallelism: communication in LULESH

#pragma omp parallel num_threads(2){if (omp_get_thread_num() == 0){

/* evaluate time constraint */CalcCourantConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.qqc(),domain.dtcourant()) ;

}if (omp_get_thread_num() == (omp_get_num_threads() -

1) ){/* check hydro constraint */CalcHydroConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.dvovmax(),domain.dthydro()) ;

}

Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)

Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)

38

Asynchronous execution

39

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

}

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. )

#pragma omp target teams … if(target:USE_DEVICE)for (Index_t i = 0; i < length ; ++i) {

Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

}

#pragma omp target … if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i];

…}

Asynchronous executionin LULESH

40

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

}

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList, dep_flag);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. int dep_flag)

#pragma omp target teams … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0; i < length ; ++i) {Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

}

#pragma omp target … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i]; …}

0

50,000

100,000

150,000

200,000

250,000

18

27

163K158K 158K

203K 197K 196K

FOM

(z/

s)

# of Nodes

LULESH (PWR8 + Pascal)

Synch

Asynch

Asynchronous executionin LULESH

41

[implicit] Placing Data in GPU’s Shared Memory

BLK_SZ is known at compile time.

VAL is team private

Performance.achieved BW is measured as (Nr*Nc*2*8bytes)/(kernel time)BLK_SZ=32we measure ~900GB/s wile using shared memory and ~40GB/s without …

42

Acknowledgement:

IBM Compiler and OpenMP-runtime team:

Ettore Tiotto, Tarique Islam, Bardia Mahjour, Zarko Todorovski, Wael Yehia, Rafik Zurob, Wang Chen, Kelvin Li,

Alexandre Eschenberger, George Bercea, Kevin O’Brien

LLNL’s personnel : Riyaz Haque, Tom Scogland

Documents

Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications