Upload
penha
View
39
Download
0
Embed Size (px)
DESCRIPTION
On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results - PowerPoint PPT Presentation
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
On pearls and perils of hybrid OpenMP/MPI programming on the
Blue Horizon
D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser
San Diego Supercomputing Center
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Overview
• Blue Horizon Hardware• Motivation for this work• Two methods of hybrid programming• Fine grain results• A word on coarse grain techniques• Coarse grain results• Time variability • Effects of thread binding• Final Conclusions
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Blue Horizon Hardware
• 144 IBM SP High Nodes• Each node:
• 8-way SMP • 4 GB memory• crossbar
• Each processor: • Power3 222 MHz • 4 Flop/cycle• Aggregate peak 1.002 Tflop/s
• Compilers:• IBM mpxlf_r, version 7.0.1• KAI guidef90, version 3.9
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Blue Horizon Hardware• Interconnect (between nodes):
• Currently:
• 115 MB/s • 4 MPI tasks/nodeMust use OpenMP to utilize all processors
• Soon:
• 500 MB/s• 8 MPI tasks/nodeCan use OpenMP to supplement MPI (if it’s worth it)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Hybrid Programming: why use it?• Non-performance-related reasons
• Avoid replication of data on the node
• Performance-related reasons: • Avoid latency of MPI on the node• Avoid unnecessary data copies inside the node• Reduce latency of MPI calls between the nodes• Decrease global MPI operations (reduction, all-to-all)
• The price to pay:• OpenMP Overheads• False sharing
Is it really worth trying?
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Hybrid Programming
• Two methods of combining MPI and OpenMP in parallel programs
Fine grain Coarse grainmain program! MPI initialization....! cpu intensive loop!$OMP PARALLEL DOdo i=1,n !workend do ....end
main program!MPI initialization!$OMP PARALLEL.... do i=1,n !work end do....!$OMP END PARALLELend
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Hybrid programming
Fine grain approach• Easy to implement• Performance: low due to overheads of OpenMP
directives (OMP PARALLEL DO)
Coarse grain approach• Time-consuming implementation• Performance: less overhead for thread creation
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Hybrid NPB using fine grain parallelism
• CG, MG, and FT suites of NAS Parallel Benchmarks (NPB).
Suite name # loops parallelizedCG - Conjugate Gradient 18MG - Multi-Grid 50FT - Fourier Transform 8
• Results shown are the best of 5-10 runs• Complete results at
http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Fine grain results - CG (A&B class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Fine grain results - MG (A&B class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Fine grain results - MG (C class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Fine grain results - FT (A&B class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Fine grain results - FT (C class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Hybrid NPB using coarse grain parallelism: MG suite
Overview of the method
Task 2
Task 3 Task 4
Task 1
Thread 1 Thread 2
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Coarse grain programming methodology
• Start with MPI code• Each MPI task spawns threads once in the beginning • Serial work (initialization etc) and MPI calls are done inside
MASTER or SINGLE region• Main arrays are global• Work distribution: each thread gets a chunk of the array
based on its number (omp_get_thread_num()). In this work, one-dimensional blocking.
• Avoid using OMP DO• Careful with scoping and synchronization
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Coarse grain results - MG (A class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Coarse grain results - MG (C class)
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Coarse grain results - MG (C class)• Full node results
# of SMP NodesMPI Tasks x
OpenMP ThreadsMax MOPS/CPU
8 4x2 19.1
8 2x4 14.9
Min MOPS/CPU
64 4x2 15.6 3.7
8 1x8 84.2 13.6
64 4x2
64 1x8
75.7
92.6
64 1x8
64 2x4
64 2x4
18.6
21.2 5.3
56.8 42.3
15.4 5.6
8.2 2.2
49.5
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Variability
• 2 -- 5 times (on 64 nodes)• Seen mostly when the full node is used• Seen both in fine grain and coarse grain runs• Seen both with IBM and KAI compiler• Seen in runs on the same set of nodes as well as
between different sets• On a large number of nodes, the average
performance suffers a lot• Confirmed in micro-study of OpenMP on 1 node
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
OpenMP on 1 node microbenchmark results
http://www.sdsc.edu/SciComp/PAA/Benchmarks/Open/open.html
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Thread binding
Question: is variability related to thread migration?• A study on 1 node:
– Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds
– Monitor processor id and run time for each thread– Repeat 100 times– Threads bound OR not bound
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Thread binding
Results for OMP_NUM_THREADS=8• Without binding, threads migrate in about 15% of the runs• With thread binding turned on there was no migration • 3% of iterations had threads with runtimes > 2.0 sec.,
a 25% slowdown• Slowdown occurs with/without binding• Effect of single thread slowdown
• Probability that complete calculation will be slowedP=1-(1-c%)^M with c=3% M=144 nodes of Blue HorizonP=0.9876 probability overall results slowed by 25%
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Thread binding
• Calculation was rerun• OMP_NUM_THREADS = 7• 12.5% reduction in computational power• No threads showed a slowdown, all ran in about 1.6
seconds
• Summary• OMP_NUM_THREADS = 7
– yields 12.5% reduction in computational power• OMP_NUM_THREADS = 8
– 0.9876 probability overall results slowed by 25% independent of thread binding
SAN DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Overall ConclusionsBased on our study of NPB on Blue Horizon:
• Fine grain hybrid approach is generally worse than pure MPI
• Coarse grain approach for MG is comparable with pure MPI or slightly better
• Coarse grain approach is time and effort consuming• Coarse grain techniques are given• Big variability when using the full node. Until this is
fixed, recommend to use less than 8 threads• Thread binding does not influence performance