Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number

For additional information, contact:

Azamat MametjanovArgonne National Laboratory

(630) [email protected] climatemodeling.science.energy.gov/acme

Lightweight threading and vectorization with OpenMP in ACMEAzamat Mametjanov, Robert Jacob, Mark Taylor

Next-generation DOE machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips.Applications are required to increase their fine-grained threading and vectorization potential tofully occupy ~60 cores each with 8-double-wide vector units. While the new on-packagehigh-bandwidth memory is expected to provide 5x speedup to existing executables,significant source code refactoring and fine-grained parallelization is needed to achieve betterefficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number of tasks per node. Our goal is to leverage shared-memoryparallelism of OpenMP to efficiently utilize available hardware threads and vector units.

Primary methods toward this goal:1. Parallelize sequential loops2. Collapse nested loops3. Vectorize innermost loops

Complicating factors are threading and SIMD overheads, bit-for-bit correctness and portabilityof speedups and correctness across supported compilers – Intel, IBM, PGI, GNU.

Objective

Profile to pick compute-bound regions• CrayPAT, HPCToolkit, Intel VTune/Advisor/Inspector, …• No “hot” regions: many ”warm”OpenMP updates (in diff format)call t_startf(’omp_par_do_region') ! add a timer!$omp parallel do num_threads(hthreads), default(shared), private(ie) ! parallelizedo ie = nets, nete

call omp_set_num_threads(vthreads) ! nested threading!$omp parallel do private(q,k) collapse(2) ! collapse nested loopsdo q = 1 , qsize ! reorder leading dim

do k=1 , nlev ! into inner loop!$omp simd ! vectorizegradQ(:, :, 1) = Vstar(:, :, 1, k) * elem(ie)%state%Qdp(:, :, k, q, ie)

call t_stopf(' ’omp_par_do_region')Benchmark: based on env_mach_pes.xml, improve timing_stats• Also, inspect compiler-generated listings and optimization reportsTest: (PET, PEM, PMT, ERP tests) on (F-, I-, C-, G-, B-cases) without cprnc diffsPort: re-bench and re-test on other compilers

Approach

Enabled nested threading in ACME• Nested threading shows speedups: e.g. 1.71x speedup in atm/lnd coupled F-cases.• Beyond scaling limits, nested threads occupy “unused” cores: e.g. ne120 on 8K Mira partition

Improving OpenMP run-time support• BOLT (BOLT is OpenMP over Lightweight Threads) provides 1.6x speedup to nested threading• Intel is using transport_semini-app to improve libraries and compiler support

Early access to KNL nodes• 64 cores and 256 hardware threads with 16 GB high-bandwidth memory• Nested threads are able to use all 256 hardware threads while fitting into 16 GB HBM

Impact

0.189

0.232

0.341

0.134

0.243

0.415

0.133

0.244

0.432

0.189

0.305

0.427

0.2270.230

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

SYPD

NodesxRanksxHorizthreads(xVertthreads)

CompsetFC5AV1C-01atne120_ne120resolutiononMira

Nodesubscriptionandnestedthreading

1.71xspeedup

1.22xspeedup

0.341

0.4150.432 0.427

0.506

0.5920.616 0.602

0.638

0.7320.767

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1350x1x64

1350x2x32

1350x4x16

1350x8x8

2700x1x64

2700x2x32

2700x4x16

2700x8x8

5400x1x64

5400x2x32

5400x4x16

5400x8x8

64elemens/node 32elements/node 16elements/node

SYPD

NodesxRanksxThreads

CompsetFC5AV1C-01atne120_ne120resolutiononMira

HybridMPI+OpenMPoptimizationandscaling

Documents

Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number