Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Speaker name, TitleCompany/Organization Name
Join the Conversation #OpenPOWERSummit
Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory
Leopold Grinberg, IBM/Research, T.J. Watson Center
Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory
for (i=0; i < N; ++i)
data[i] = …
2
42 TF and ~ 6TB/s memory BW
3
AC922
Memory: system main memory, GPU’s HBM.
Concurrency: ~1M threads running on the
CPUs and GPUs.
Data and memory management Execution policy, and expressing parallelism
Designing portable and performance portable code
Unified Addressing Compiler Directive based programming
Examples: • Memory/data management with OpenMP4.5 directives and
Unified Addressing• Nested data structures, std::vector• Nested parallelism• Examples from CORAL-1 benchmarks (LULESH/etc..)• Asynchronous execution
Hardware
Challenge
Strategy
Value
4
AC922: POWER9 + V100 + NVLink 2.0
V-100: 80 SMs; up to 2048 threads per SM; up to 32 CUDA blocks per SM
POWER9: 22 cores ; 4 hardware threads per core; NVLink 2.0; PCIe 4
5
Challenge: Keeping 1M Threads on a Single Node Busy
(22*4*2=) 176 CPU threads + (80*2048*6=) 983,040 GPU threads
6
HBM2 HBM2 HBM2
DDR4
HBM2 HBM2
DDR4
HBM2
Challenge: Managing Multiple Memories and ~0.5 TB of data
7
Programming Languages and Compilers on OpenPOWER
Key Features
• Direct access to the GPU instruction set
• When leveraging NVIDIA GPUs, generally achieves best performance
• Compilers: XL Fortran, NVCC, PGI CUDA Fortran
• Host compilers: GCC, XL, PGI, CLANG
• High level directives for heterogeneous CPU + NVIDIA GPU systems
• Platform/accelerator portable
• Fallback execution for safety
• Compilers: IBM XL, LLVM/Clang compiler, GCC
• High level directives for heterogeneous CPU + NVIDIA GPU systems
• Directive based parallelization for accelerator device
• Compilers: PGI, GCC
CUDA
8
Why use OpenMP 4.x ?
The ultimate goal for developers using OpenMP4.0 and beyond is to achieve:
a) portability
b) performance portability
while using the same source code and compiling it on different platforms.
OpenMP4.5 allows incremental transition of applications:non-threaded codes can be first parallelized using OpenMP directives (if algorithm allows parallelization) tested on the host (CPU) and then offloaded to the device (GPU)
for (i=0; i<N;i++)y[i] = a*x[i]+y[i]
#pragma omp parallel forfor (i=0; i<N;i++)
y[i] = a*x[i]+y[i]
#pragma omp target teams distribute parallel for if(0)for (i=0; i<N;i++)
y[i] = a*x[i]+y[i]
#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N]) if(target:1)for (i=0; i<N;i++)
y[i] = a*x[i]+y[i]
9
Code comments effort No offloading With offloading comment
LULESH XLC, BW limited
2-3 days FOM: 17,000 / node FOM: 196,000 / node 27 nodes
AMG2013 XLC, Read BW limited, cuSparse
< week FOM: 0.7e+08 / node FOM: 9.4e+08 / node 1 node
HPCG** CLANG, Read BW limited
3 weeks FOM: 15.8 FOM: 197 1 node
Quicksilver “GPU hostile code”, load balancing issues… tracking kernel time.
1 mo. code restructuring. 2 days porting
35 s 26.7 s (2CPUs + 4 GPUs) 1 node
Opacitylibrary*
Table lookups, integer arithmetic.
~3 weeks Speedup: 1x Up to 4x with data transfersup to 30x with data in GPU
1 P8 vs. 1 P-100
Why use OpenMP 4.x ?
12x
12x
12x
Simulations on IBM Minsky nodes (2 POWER8 CPUs and 4 P-100 GPUs)*Joint work with LLNL and IBM; **Sequential Gauss-Seidel has been replaced with multi-colored Gauss-Seidel
10
Managing memory
Defining execution
space
Managing data
Nesting parallel regions
Coarse-grain parallelism
Fine-grain parallelism
Intra-node communication
Inter-node communication
OPENMP 4.5
Programming with OpenMP 4.5
11
Challenge: managing memory and data with OpenMP4.5
HBM2 HBM2
DDR
HBM2
Managing memory:• Memory allocation/deallocation• Use of memory pools • Use of Unified Addressing• Preventing page migration
Managing data:• Replication of data on different memories + synchronization• Placing data in buffers provided by memory pools • Use of Unified Addressing • [Random] switching execution between HOST and DEVICE
requires careful synchronization• Using [same] data by HOST and DEVICES concurrently
and/or in stages
12
Managing memory and data
Memory/data management using
OpenMP directives and API
Memory/data management using Unified Addressing
Memory/data management while mixing
OpenMP directives/API and Unified Addressing
13
Managing memory and data using OpenMP 4.5Managing memory and data is typically the first task that developers have to tackle while porting applications from homogeneous to heterogeneous system.
OpenMP4.5 provides with a number of options:
1. Use of directives (map, enter data, release, delete,….)
2. Use of OpenMP API calls (omp_target_alloc, omp_target_memcopy, …)
3. Mixing 1 and 2 by allocating memory on the device with omp_target_alloc or CUDA APIs and on the host with malloc, associating pointers using omp_target_associate_ptr and applying directives (map, update, …)
14
Managing memory and data using OpenMP 4.5* -data replication model
“…The syntax of the map clause is as follows:map([ [map-type-modifier[,]] map-type : ] list)where map-type is one of the following:..“to
from
tofrom
alloc
release
delete
map-type-modifier is always.
Allocate* device memory;
copy* from CPU to GPU; deallocate* device memoryallocate* device memory;
copy* from GPU to CPU;deallocate* device memory allocate* device memory;
copy* from CPU to GPU; {kernel};
copy from GPU to CPU; deallocate* memoryallocate* device memory
on GPU
reduce reference count; and possibly delete…
deallocate* device memory
*IBM’s implementation
15
Managing Memory and Data: Example
double *x, *y;int N = 3238289;
x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)
x[i] = 0.1*i;
#pragma omp target teams distribute parallel for map(to:x[0:N]) map(from:y[0:N])for (i = 0; i < N; ++i)
y[i] = sin(x[i])*cos(x[i]);x[0:N] y[0:N]
x[0:N] y[0:N]
CPU Memory
GPU Memory
16
double *x, *y;int N = 3238289;
x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)
x[i] = 0.1*i;
#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])
#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i)
y[i] = sin(x[i])*cos(x[i]);
#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])
x[0:N] y[0:N]
x[0:N] y[0:N]
CPU Memory
GPU Memory
Managing Memory and Data: Example
OpenMP runtime will automatically detect if
arrays x and y are mapped
17
Managing Memory and Data: Example
double *x, *y;int N = 3238289;
x = new double[N]; y = new double[N]; for (i = 0; i < N; ++i) x[i] = 0.1*i;
#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])
#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i) y[i] = sin(x[i])*cos(x[i]);
functionA(x,y,N);
#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])
functionA(double *x, double *y, int N){#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N])for (i = 0; i < N; ++i) y[i] = 2.0*x[i] + y[i];
}
x[0:N] y[0:N]
x[0:N] y[0:N]
CPU Memory
GPU Memory
Use case for reference counterssince x and y are mapped, maps here will be
effectively noops
18
Developers working on codes for simulations on CPUs and GPUs can also mix CUDA API and OpenMP4.5 directives/API call for memory/data management.
For example, in our implementation of AMG2013 benchmark we use OpenMP4.5 for memory allocation and data initialization, and some kernels, while we use cuSparse library for optimized to GPUs sparse matrix vector multiplications.
#pragma omp target enter data map(to: x_data[0:x_size],y_data[0:y_size])…#pragma omp target data use_device_ptr(x_data[0:x_size],y_data[0:y_size])
{cusparseDcsrmv(cu_spmv_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, num_rows, num_cols, nnz,
&alpha, cu_spmv_descr, d_A_data, d_A_i, d_A_j, x_data, &beta, y_data);}
Warning: Making code portable to different types of architectures will require
additional work! cuSparse is not available on systems without NVIDIA GPUs.
Managing memory and data using OpenMP 4.5:interoperability with NVIDIA libraries
19
Managing memory/data: deeply nested data structures
20
Class MC_Cell_state: qs_vec<double> _number_dencity
qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _taskClass MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh
qs_vector<MC_Domain> domain
Class MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh
Class MC_Cell_state: qs_vec<double> _number_dencity
qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _task
Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total
Class MC_Mesh_Domain:qs_vec<int> _nbrDomainGidqs_vec<int> _nbrRank
qs_vec<MC_Vector> _nodeqs_vec<MC_Facet_Adjacency_Cell> _cellConnectivityqs_vec<MC_Facet_Geometry_Cell> cell_geometry
Class MC_Vector { double; double double}
Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total Class : task_precomputed_multigroup_macroscopic_cross_sections_type
qs_vector<double> _total
Class MC_Vector { double; double double}Class MC_Vector { double; double double}
Class MC_Vector { double; double; double}
Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point
Class MC_Facet_AdjacencySubfacet_Adjacency subfacet
Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point
Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point
Class MC_Facet_AdjacencySubfacet_Adjacency subfacet
Class MC_Facet_AdjacencySubfacet_Adjacency subfacet
Class Subfacet_AdjacencyMC_Subfacet_Adjacency_Event::Enum
event; MC_Location current;MC_Location adjacent;
Class MC_Location { int; int; int}
Class MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facet
Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}
21
Managing memory/data: deeply nested data structures
a {*y, size} y[0:N]
a {*y, size} y[0:N]
CPU Memory
GPU Memory
22
Managing memory/data: deeply nested data structures
#pragma omp target enter data map(to:a[0:1])#pragma omp target enter data map(to:a->y[0:N])#pragma omp target{
a->y[3] += …}
#pragma omp target exit data map(release:a->y[0:N])#pragma omp target exit data map(release:a[0:1])
#pragma omp target data map(to:y[0:n], a[0:1]) #pragma omp target{
a->y = y;}#pragma omp target{
a->y[3] += …}
a {*y, size} y[0:N]
a {*y, size} y[0:N]
CPU Memory
GPU Memory
23
Managing memory/data: Unified Addressing
Example: std::vector
24
template <class T>
struct UMAllocator {
typedef T value_type;
UMAllocator() {}
template <class U> UMAllocator(const UMAllocator<U>& other);
T* allocate(std::size_t n)
{
T* ptr;
#ifdef USE_CUDA_MANAGED
cudaMallocManaged(&ptr, n*sizeof(T));
#else
ptr = (T*) malloc(n*sizeof(T));
#endif
return ptr;
}
void deallocate(T* p, std::size_t n)
{
#ifdef USE_CUDA_MANAGED
cudaFree(p);
#else
free(p);
#endif
}
};
Managing memory/data: deeply nested data structures
std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);
25
template <class T>
struct UMAllocator {
typedef T value_type;
UMAllocator() {}
template <class U> UMAllocator(const UMAllocator<U>& other);
T* allocate(std::size_t n)
{
T* ptr;
ptr = (T*) malloc(n*sizeof(T));
#ifdef USE_ATS
cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0); //required today performance reason
cudaDeviceSynchronize();
#endif
return ptr;
}
void deallocate(T* p, std::size_t n)
{
free(p);
}
};
std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);
We expect that in the future prefetching will be handled by the OS, and CUDA API will not be required
Managing memory/data:deeply nested data structures
Will not work on systems not supporting ATS
26
Managing memory/data on systems with ATS*
(address translation service)
*Part of this presentation includes IBM’s extensions for OpenMP4.5 and features of OpenMP5.0 already implemented in compilers supporting OpenMP4.5
27
int main(){
int N=20;
double *data = new double[N];
omp_set_default_device(0);
#pragma omp target teams distribute parallel for map(from:data[0:N])
for (int i = 0; i < N; ++i)
data[i] = i*0.1;
for (int i = 0; i < N; i+=4)
printf("data[%d] = %g\n",i,data[i]);
delete [] data;
return 0;
}
Managing memory/data on systems with ATS
28
int main(){
int N=20;
double *data = new double[N];
omp_set_default_device(0);
#pragma omp target teams distribute parallel for is_device_ptr(data)
for (int i = 0; i < N; ++i)
data[i] = i*0.1;
for (int i = 0; i < N; i+=4)
printf("data[%d] = %g\n",i,data[i]);
delete [] data;
return 0;
}
Systems with ATS enabled
Managing memory/data on systems with ATS
29
int main(){
int N=20;
double *data = new double[N];
omp_set_default_device(0);
#pragma omp target teams distribute parallel for //is_device_ptr(data)
for (int i = 0; i < N; ++i)
data[i] = i*0.1;
for (int i = 0; i < N; i+=4)
printf("data[%d] = %g\n",i,data[i]);
delete [] data;
return 0;
}
Systems with ATS enabled and
export XLSMPOPTS=TARGETMEM=UIMPLICIT
Managing memory/data on systems with ATS
30
int main(){
int N=20;
double *data, *data2;
omp_set_default_device(0);
data = new double[2];
data2 = new double[N];
#pragma omp target teams distribute parallel for map(from:data2[0:N])//is_device_ptr(data)
for (int i = 0; i < N; ++i)
data2[i] = data[i%2] + i*0.1;
delete [] data;
delete [] data2;
return 0;
}
Systems with ATS enabled and
export XLSMPOPTS=TARGETMEM=UIMPLICIT
Managing memory/data on systems with ATS
Here “map” is not ignored,memory for data2 is allocated on the device and content of data2 is being copied from device to host
31
subroutine foo(a,n)
real*8, dimension(n) :: a
!$omp target teams distribute parallel do
do i=1,N ; a(i)=0.1*i; end do
end
program test_implicit_ats
integer, parameter :: N=20
real*8, dimension(:), allocatable :: data
allocate(data(N))
call foo(data,N)
print *,data(10)
end
nvprof ./a.out
1.9520us 160B 78.170MB/s Pinned Device Tesla V100-SXM2 [CUDA memcpy HtoD]
1.6000us - Tesla V100-SXM2 __xl_foo_l3_OL_1 [156]
2.0480us 160B 74.506MB/s Device Pinned Tesla V100-SXM2 [CUDA memcpy DtoH]
export XLSMPOPTS=TARGETMEM=UIMPLICITnvprof ./a.out
7.1040us - - - Tesla V100-SXM2 __xl_foo_l3_OL_1 [150]
Managing memory/data on systems with ATS
Fortran
Contributed by Lixiang Luo, IBM Research
32
Simulations with 1 MPI rank/GPU Simulations with 2 MPI ranks/GPU [+MPS]
#ifdef USE_ATS
ptr = (T*) malloc(n*sizeof(T));
cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);
cudaDeviceSynchronize();
#else …
#ifdef USE_CUDA_MANAGED
cudaMallocManaged(&ptr, n*sizeof(T));
cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);
cudaDeviceSynchronize();
#else …
# MPI ranks
#nodes FOM/node: CUDA Managed
FOM: ATS
1000 166.7 312,782 327,581
1728 288 308,760 328,513
# MPI ranks
#nodes FOM/node: CUDA Managed
FOM: ATS
1000 83.3 332,393 358,852
1728 144 331,248 358,538
LULESH: performance with ATS and CUDA Managed Memory
33
OpenMP: Nested Parallel regions on CPUs and GPUs
34
Nested parallelism + concurrent execution on all devices
35
Nested parallelism + concurrent execution on all devices int main(){
double *x, *y;double DEVICE_FRACTION=0;int num_devices, i, chunk, j_start, N = 1024*1024*10;bool USE_DEVICE;x = new double[N]; y = new double[N];//enable nested parallelismomp_set_nested(1);//get number of devicesnum_devices = omp_get_num_devices();if (num_devices>0) DEVICE_FRACTION=0.9;#pragma omp parallel for num_threads(num_devices+1) private(chunk, j_start, USE_DEVICE)for( i < (num_devices+1); ++i){
if (i < num_devices){ omp_set_default_device(i);chunk = DEVICE_FRACTION * N / num_devices;j_start = chunk*I;USE_DEVICE=true;
} else {chunk = N; //defaultj_start = 0; //defaultUSE_DEVICE=false; //defaultif (num_devices > 0){
j_start = (DEVICE_FRACTION * N / num_devices) * num_devices;chunk = N – j_start;
}}initialize_x_and_y( x+j_start, y+j_start, chunk, j_start, USE_DEVICE);
}free(x); free(y);return 0;}
void initialize_x_and_y(double *x, double *y, int N, int offset, bool USE_DEVICE){#pragma omp target teams distribute parallel for map(from:x[0:N], y[0:N]) if(target:USE_DEVICE)
for (int i=0; i < N; ++i){x[i] = (offset + i) * 0.001; y[i] = (offset + i) * 0.003;
}}
36
Nested parallelism: communication in LULESH#pragma omp parallel sections private(pmsg,emsg,cmsg,destAddr){
#pragma omp section{
if (planeMin | planeMax) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm] ;
#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0 ; fi<xferFields; ++fi) { for (Index_t i=0; i<sendCount; ++i) { destAddr[i+sendCount*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;
}
#pragma omp section{
if (rowMin && planeMin && not_planeOnly) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm + emsg * maxEdgeComm] ;
#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0; fi<xferFields; ++fi) { for (Index_t i=0; i<dx; ++i) { destAddr[i + dx*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;
}}…..
37
Nested parallelism: communication in LULESH
#pragma omp parallel num_threads(2){if (omp_get_thread_num() == 0){
/* evaluate time constraint */CalcCourantConstraintForElems(domain,
domain.regElemSize(r),domain.regElemlist(r),domain.qqc(),domain.dtcourant()) ;
}if (omp_get_thread_num() == (omp_get_num_threads() -
1) ){/* check hydro constraint */CalcHydroConstraintForElems(domain,
domain.regElemSize(r),domain.regElemlist(r),domain.dvovmax(),domain.dthydro()) ;
}
Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)
Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)
38
Asynchronous execution
39
void CalcEnergyForElems( …..){ …
#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }
#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {
e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;
}
CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList);
#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){
const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}
void CalcPressureForElems(Real_t* p_new, …. )
#pragma omp target teams … if(target:USE_DEVICE)for (Index_t i = 0; i < length ; ++i) {
Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;
}
#pragma omp target … if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i];
…}
Asynchronous executionin LULESH
40
void CalcEnergyForElems( …..){ …
#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) nowait depend(inout:dep_flag) if(target:USE_DEVICE)
for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }
#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {
e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;
}
CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList, dep_flag);
#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){
const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}
void CalcPressureForElems(Real_t* p_new, …. int dep_flag)
#pragma omp target teams … nowait depend(inout:dep_flag) if(target:USE_DEVICE)
for (Index_t i = 0; i < length ; ++i) {Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;
}
#pragma omp target … nowait depend(inout:dep_flag) if(target:USE_DEVICE)
for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i]; …}
0
50,000
100,000
150,000
200,000
250,000
18
27
163K158K 158K
203K 197K 196K
FOM
(z/
s)
# of Nodes
LULESH (PWR8 + Pascal)
Synch
Asynch
Asynchronous executionin LULESH
41
[implicit] Placing Data in GPU’s Shared Memory
BLK_SZ is known at compile time.
VAL is team private
Performance.achieved BW is measured as (Nr*Nc*2*8bytes)/(kernel time)BLK_SZ=32we measure ~900GB/s wile using shared memory and ~40GB/s without …
42
Acknowledgement:
IBM Compiler and OpenMP-runtime team:
Ettore Tiotto, Tarique Islam, Bardia Mahjour, Zarko Todorovski, Wael Yehia, Rafik Zurob, Wang Chen, Kelvin Li,
Alexandre Eschenberger, George Bercea, Kevin O’Brien
LLNL’s personnel : Riyaz Haque, Tom Scogland