1
For additional information, contact: Azamat Mametjanov Argonne National Laboratory (630) 252-1074 [email protected] climatemodeling.science.energy.gov/acme Lightweight threading and vectorization with OpenMP in ACME Azamat Mametjanov, Robert Jacob, Mark Taylor Next-generation DOE machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips. Applications are required to increase their fine-grained threading and vectorization potential to fully occupy ~60 cores each with 8-double-wide vector units. While the new on-package high-bandwidth memory is expected to provide 5x speedup to existing executables, significant source code refactoring and fine-grained parallelization is needed to achieve better efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number of tasks per node. Our goal is to leverage shared-memory parallelism of OpenMP to efficiently utilize available hardware threads and vector units. Primary methods toward this goal: 1. Parallelize sequential loops 2. Collapse nested loops 3. Vectorize innermost loops Complicating factors are threading and SIMD overheads, bit-for-bit correctness and portability of speedups and correctness across supported compilers – Intel, IBM, PGI, GNU. Objective Profile to pick compute-bound regions CrayPAT, HPCToolkit, Intel VTune/Advisor/Inspector, … No “hot” regions: many ”warm” OpenMP updates (in diff format) call t_startf(’omp_par_do_region') ! add a timer !$omp parallel do num_threads(hthreads), default(shared), private(ie) ! parallelize do ie = nets, nete call omp_set_num_threads(vthreads) ! nested threading !$omp parallel do private(q,k) collapse(2) ! collapse nested loops do q = 1 , qsize ! reorder leading dim do k=1 , nlev ! into inner loop !$omp simd ! vectorize gradQ(:, :, 1) = Vstar(:, :, 1, k) * elem(ie)%state%Qdp(:, :, k, q, ie) call t_stopf(' ’omp_par_do_region') Benchmark: based on env_mach_pes.xml, improve timing_stats Also, inspect compiler-generated listings and optimization reports Test: (PET, PEM, PMT, ERP tests) on (F-, I-, C-, G-, B-cases) without cprnc diffs Port: re-bench and re-test on other compilers Approach Enabled nested threading in ACME Nested threading shows speedups: e.g. 1.71x speedup in atm/lnd coupled F-cases. Beyond scaling limits, nested threads occupy “unused” cores: e.g. ne120 on 8K Mira partition Improving OpenMP run-time support BOLT (BOLT is OpenMP over Lightweight Threads) provides 1.6x speedup to nested threading Intel is using transport_se mini-app to improve libraries and compiler support Early access to KNL nodes 64 cores and 256 hardware threads with 16 GB high-bandwidth memory Nested threads are able to use all 256 hardware threads while fitting into 16 GB HBM Impact 0.189 0.232 0.341 0.134 0.243 0.415 0.133 0.244 0.432 0.189 0.305 0.427 0.227 0.230 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 SYPD Nodes x Ranks x Horiz threads (x Vert threads) Compset FC5AV1C-01 at ne120_ne120 resolution on Mira Node subscription and nested threading 1.71x speedup 1.22x speedup 0.341 0.415 0.432 0.427 0.506 0.592 0.616 0.602 0.638 0.732 0.767 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1350x1x64 1350x2x32 1350x4x16 1350x8x8 2700x1x64 2700x2x32 2700x4x16 2700x8x8 5400x1x64 5400x2x32 5400x4x16 5400x8x8 64 elemens/node 32 elements/node 16 elements/node SYPD Nodes x Ranks x Threads Compset FC5AV1C-01 at ne120_ne120 resolution on Mira Hybrid MPI+OpenMP optimization and scaling

Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number

For additional information, contact:

Azamat MametjanovArgonne National Laboratory

(630) [email protected] climatemodeling.science.energy.gov/acme

Lightweight threading and vectorization with OpenMP in ACMEAzamat Mametjanov, Robert Jacob, Mark Taylor

Next-generation DOE machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips.Applications are required to increase their fine-grained threading and vectorization potential tofully occupy ~60 cores each with 8-double-wide vector units. While the new on-packagehigh-bandwidth memory is expected to provide 5x speedup to existing executables,significant source code refactoring and fine-grained parallelization is needed to achieve betterefficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number of tasks per node. Our goal is to leverage shared-memoryparallelism of OpenMP to efficiently utilize available hardware threads and vector units.

Primary methods toward this goal:1. Parallelize sequential loops2. Collapse nested loops3. Vectorize innermost loops

Complicating factors are threading and SIMD overheads, bit-for-bit correctness and portabilityof speedups and correctness across supported compilers – Intel, IBM, PGI, GNU.

Objective

Profile to pick compute-bound regions• CrayPAT, HPCToolkit, Intel VTune/Advisor/Inspector, …• No “hot” regions: many ”warm”OpenMP updates (in diff format)call t_startf(’omp_par_do_region') ! add a timer!$omp parallel do num_threads(hthreads), default(shared), private(ie) ! parallelizedo ie = nets, nete

call omp_set_num_threads(vthreads) ! nested threading!$omp parallel do private(q,k) collapse(2) ! collapse nested loopsdo q = 1 , qsize ! reorder leading dim

do k=1 , nlev ! into inner loop!$omp simd ! vectorizegradQ(:, :, 1) = Vstar(:, :, 1, k) * elem(ie)%state%Qdp(:, :, k, q, ie)

call t_stopf(' ’omp_par_do_region')Benchmark: based on env_mach_pes.xml, improve timing_stats• Also, inspect compiler-generated listings and optimization reportsTest: (PET, PEM, PMT, ERP tests) on (F-, I-, C-, G-, B-cases) without cprnc diffsPort: re-bench and re-test on other compilers

Approach

Enabled nested threading in ACME• Nested threading shows speedups: e.g. 1.71x speedup in atm/lnd coupled F-cases.• Beyond scaling limits, nested threads occupy “unused” cores: e.g. ne120 on 8K Mira partition

Improving OpenMP run-time support• BOLT (BOLT is OpenMP over Lightweight Threads) provides 1.6x speedup to nested threading• Intel is using transport_semini-app to improve libraries and compiler support

Early access to KNL nodes• 64 cores and 256 hardware threads with 16 GB high-bandwidth memory• Nested threads are able to use all 256 hardware threads while fitting into 16 GB HBM

Impact

0.189

0.232

0.341

0.134

0.243

0.415

0.133

0.244

0.432

0.189

0.305

0.427

0.2270.230

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

SYPD

NodesxRanksxHorizthreads(xVertthreads)

CompsetFC5AV1C-01atne120_ne120resolutiononMira

Nodesubscriptionandnestedthreading

1.71xspeedup

1.22xspeedup

0.341

0.4150.432 0.427

0.506

0.5920.616 0.602

0.638

0.7320.767

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1350x1x64

1350x2x32

1350x4x16

1350x8x8

2700x1x64

2700x2x32

2700x4x16

2700x8x8

5400x1x64

5400x2x32

5400x4x16

5400x8x8

64elemens/node 32elements/node 16elements/node

SYPD

NodesxRanksxThreads

CompsetFC5AV1C-01atne120_ne120resolutiononMira

HybridMPI+OpenMPoptimizationandscaling