24
SAN DIEGO SUPERCOMPUTER CENTER NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett- Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

  • Upload
    penha

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results - PowerPoint PPT Presentation

Citation preview

Page 1: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

On pearls and perils of hybrid OpenMP/MPI programming on the

Blue Horizon

D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser

San Diego Supercomputing Center

Page 2: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Overview

• Blue Horizon Hardware• Motivation for this work• Two methods of hybrid programming• Fine grain results• A word on coarse grain techniques• Coarse grain results• Time variability • Effects of thread binding• Final Conclusions

Page 3: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Blue Horizon Hardware

• 144 IBM SP High Nodes• Each node:

• 8-way SMP • 4 GB memory• crossbar

• Each processor: • Power3 222 MHz • 4 Flop/cycle• Aggregate peak 1.002 Tflop/s

• Compilers:• IBM mpxlf_r, version 7.0.1• KAI guidef90, version 3.9

Page 4: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Blue Horizon Hardware• Interconnect (between nodes):

• Currently:

• 115 MB/s • 4 MPI tasks/nodeMust use OpenMP to utilize all processors

• Soon:

• 500 MB/s• 8 MPI tasks/nodeCan use OpenMP to supplement MPI (if it’s worth it)

Page 5: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Hybrid Programming: why use it?• Non-performance-related reasons

• Avoid replication of data on the node

• Performance-related reasons: • Avoid latency of MPI on the node• Avoid unnecessary data copies inside the node• Reduce latency of MPI calls between the nodes• Decrease global MPI operations (reduction, all-to-all)

• The price to pay:• OpenMP Overheads• False sharing

Is it really worth trying?

Page 6: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Hybrid Programming

• Two methods of combining MPI and OpenMP in parallel programs

Fine grain Coarse grainmain program! MPI initialization....! cpu intensive loop!$OMP PARALLEL DOdo i=1,n !workend do ....end

main program!MPI initialization!$OMP PARALLEL.... do i=1,n !work end do....!$OMP END PARALLELend

Page 7: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Hybrid programming

Fine grain approach• Easy to implement• Performance: low due to overheads of OpenMP

directives (OMP PARALLEL DO)

Coarse grain approach• Time-consuming implementation• Performance: less overhead for thread creation

Page 8: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Hybrid NPB using fine grain parallelism

• CG, MG, and FT suites of NAS Parallel Benchmarks (NPB).

Suite name # loops parallelizedCG - Conjugate Gradient 18MG - Multi-Grid 50FT - Fourier Transform 8

• Results shown are the best of 5-10 runs• Complete results at

http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html

Page 9: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Fine grain results - CG (A&B class)

Page 10: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Fine grain results - MG (A&B class)

Page 11: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Fine grain results - MG (C class)

Page 12: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Fine grain results - FT (A&B class)

Page 13: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Fine grain results - FT (C class)

Page 14: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Hybrid NPB using coarse grain parallelism: MG suite

Overview of the method

Task 2

Task 3 Task 4

Task 1

Thread 1 Thread 2

Page 15: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Coarse grain programming methodology

• Start with MPI code• Each MPI task spawns threads once in the beginning • Serial work (initialization etc) and MPI calls are done inside

MASTER or SINGLE region• Main arrays are global• Work distribution: each thread gets a chunk of the array

based on its number (omp_get_thread_num()). In this work, one-dimensional blocking.

• Avoid using OMP DO• Careful with scoping and synchronization

Page 16: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Coarse grain results - MG (A class)

Page 17: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Coarse grain results - MG (C class)

Page 18: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Coarse grain results - MG (C class)• Full node results

# of SMP NodesMPI Tasks x

OpenMP ThreadsMax MOPS/CPU

8 4x2 19.1

8 2x4 14.9

Min MOPS/CPU

64 4x2 15.6 3.7

8 1x8 84.2 13.6

64 4x2

64 1x8

75.7

92.6

64 1x8

64 2x4

64 2x4

18.6

21.2 5.3

56.8 42.3

15.4 5.6

8.2 2.2

49.5

Page 19: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Variability

• 2 -- 5 times (on 64 nodes)• Seen mostly when the full node is used• Seen both in fine grain and coarse grain runs• Seen both with IBM and KAI compiler• Seen in runs on the same set of nodes as well as

between different sets• On a large number of nodes, the average

performance suffers a lot• Confirmed in micro-study of OpenMP on 1 node

Page 20: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

OpenMP on 1 node microbenchmark results

http://www.sdsc.edu/SciComp/PAA/Benchmarks/Open/open.html

Page 21: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Thread binding

Question: is variability related to thread migration?• A study on 1 node:

– Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds

– Monitor processor id and run time for each thread– Repeat 100 times– Threads bound OR not bound

Page 22: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Thread binding

Results for OMP_NUM_THREADS=8• Without binding, threads migrate in about 15% of the runs• With thread binding turned on there was no migration • 3% of iterations had threads with runtimes > 2.0 sec.,

a 25% slowdown• Slowdown occurs with/without binding• Effect of single thread slowdown

• Probability that complete calculation will be slowedP=1-(1-c%)^M with c=3% M=144 nodes of Blue HorizonP=0.9876 probability overall results slowed by 25%

Page 23: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Thread binding

• Calculation was rerun• OMP_NUM_THREADS = 7• 12.5% reduction in computational power• No threads showed a slowdown, all ran in about 1.6

seconds

• Summary• OMP_NUM_THREADS = 7

– yields 12.5% reduction in computational power• OMP_NUM_THREADS = 8

– 0.9876 probability overall results slowed by 25% independent of thread binding

Page 24: On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

SAN DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Overall ConclusionsBased on our study of NPB on Blue Horizon:

• Fine grain hybrid approach is generally worse than pure MPI

• Coarse grain approach for MG is comparable with pure MPI or slightly better

• Coarse grain approach is time and effort consuming• Coarse grain techniques are given• Big variability when using the full node. Until this is

fixed, recommend to use less than 8 threads• Thread binding does not influence performance