Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
For additional information, contact:
Azamat MametjanovArgonne National Laboratory
(630) [email protected] climatemodeling.science.energy.gov/acme
Lightweight threading and vectorization with OpenMP in ACMEAzamat Mametjanov, Robert Jacob, Mark Taylor
Next-generation DOE machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips.Applications are required to increase their fine-grained threading and vectorization potential tofully occupy ~60 cores each with 8-double-wide vector units. While the new on-packagehigh-bandwidth memory is expected to provide 5x speedup to existing executables,significant source code refactoring and fine-grained parallelization is needed to achieve betterefficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number of tasks per node. Our goal is to leverage shared-memoryparallelism of OpenMP to efficiently utilize available hardware threads and vector units.
Primary methods toward this goal:1. Parallelize sequential loops2. Collapse nested loops3. Vectorize innermost loops
Complicating factors are threading and SIMD overheads, bit-for-bit correctness and portabilityof speedups and correctness across supported compilers – Intel, IBM, PGI, GNU.
Objective
Profile to pick compute-bound regions• CrayPAT, HPCToolkit, Intel VTune/Advisor/Inspector, …• No “hot” regions: many ”warm”OpenMP updates (in diff format)call t_startf(’omp_par_do_region') ! add a timer!$omp parallel do num_threads(hthreads), default(shared), private(ie) ! parallelizedo ie = nets, nete
call omp_set_num_threads(vthreads) ! nested threading!$omp parallel do private(q,k) collapse(2) ! collapse nested loopsdo q = 1 , qsize ! reorder leading dim
do k=1 , nlev ! into inner loop!$omp simd ! vectorizegradQ(:, :, 1) = Vstar(:, :, 1, k) * elem(ie)%state%Qdp(:, :, k, q, ie)
call t_stopf(' ’omp_par_do_region')Benchmark: based on env_mach_pes.xml, improve timing_stats• Also, inspect compiler-generated listings and optimization reportsTest: (PET, PEM, PMT, ERP tests) on (F-, I-, C-, G-, B-cases) without cprnc diffsPort: re-bench and re-test on other compilers
Approach
Enabled nested threading in ACME• Nested threading shows speedups: e.g. 1.71x speedup in atm/lnd coupled F-cases.• Beyond scaling limits, nested threads occupy “unused” cores: e.g. ne120 on 8K Mira partition
Improving OpenMP run-time support• BOLT (BOLT is OpenMP over Lightweight Threads) provides 1.6x speedup to nested threading• Intel is using transport_semini-app to improve libraries and compiler support
Early access to KNL nodes• 64 cores and 256 hardware threads with 16 GB high-bandwidth memory• Nested threads are able to use all 256 hardware threads while fitting into 16 GB HBM
Impact
0.189
0.232
0.341
0.134
0.243
0.415
0.133
0.244
0.432
0.189
0.305
0.427
0.2270.230
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
SYPD
NodesxRanksxHorizthreads(xVertthreads)
CompsetFC5AV1C-01atne120_ne120resolutiononMira
Nodesubscriptionandnestedthreading
1.71xspeedup
1.22xspeedup
0.341
0.4150.432 0.427
0.506
0.5920.616 0.602
0.638
0.7320.767
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1350x1x64
1350x2x32
1350x4x16
1350x8x8
2700x1x64
2700x2x32
2700x4x16
2700x8x8
5400x1x64
5400x2x32
5400x4x16
5400x8x8
64elemens/node 32elements/node 16elements/node
SYPD
NodesxRanksxThreads
CompsetFC5AV1C-01atne120_ne120resolutiononMira
HybridMPI+OpenMPoptimizationandscaling