Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
EU H2020 Centre of Excellence (CoE) 1 December 2018 – 30 November 2021
Grant Agreement No 824080
Case Study: 3x Speed Improvement for Zenotech's zCFD SolverNick Dingle, Numerical Algorithms Group Ltd.
• Performance Optimisation and Productivity
• A Centre of Excellence• Collaborative European project funded by Horizon 2020 programme
• Runs 1 December 2018 – 30 November 2021
• Providing Free Services within Europe• Precise understanding of parallel application and system behaviour
• Across application areas, platforms and scales
• Suggestions/support on how to rewrite code in the most productive way
• For academic and industrial codes and users
12 March 2019 2
The POP service
• Participating institutions:• Barcelona Supercomputing Center, Spain (coordinator)• HLRS, Germany• IT4Innovations, Czech Republic• Jülich Supercomputing Center, Germany• NAG, UK• RWTH Aachen, IT Center, Germany• TERATEC, France• Université de Versailles Saint-Quentin-en-Yvelines, France
• A team with:• Expertise in performance analysis and optimisation• Expertise in parallel programming models and practices• A research and development background and a
proven commitment to real academic and industrial use cases
12 March 2019 3
The POP team
• A density-based finite volume and Discontinuous Galerkin(DG) computational fluid dynamics (CFD) solver for steady-state or time-dependent flow simulation
• Decomposes domains using unstructured meshes
• Written in Python and C++ and parallelised with OpenMP and MPI
12 March 2019 4
zCFD by Zenotech
• Provides a quantitative measurement of the relative impact of the different factors inherent in parallelisation
• Uses a hierarchy of metrics, with each metric measuring a common cause of inefficiency in parallel programs
• Metrics are efficiencies between 0 and 1; higher numbers are better
• Efficiencies less than 0.8 are candidates for further investigation
5
POP methodology
12 March 2019
• The headline figure is Global Efficiency, which is the product of the Parallel and Computational Efficiencies
• Parallel Efficiency measures the effect that parallelising the code has on the runtime• E.g. how balanced between threads is the computational work, how much time is
lost to OpenMP overheads, etc
• Calculated as the ratio between the average amount of time that threads spend in useful computation (i.e. not in the OpenMP library or I/O) and the total runtime
• Computational Efficiency describes how well the computational load of the application scales with the number of threads• The ratio between the total time across all threads that the code spends in useful
computation, and the time that the serial code spends in useful computation
12 March 2019 6
The metrics
12 March 2019 7
Efficiencies
# Threads
1 2 6 12
Global Efficiency 0.97 0.71 0.52 0.33
→ Parallel Efficiency 0.97 0.80 0.64 0.50
→ Computational Efficiency 1.00 0.89 0.82 0.66
12 March 2019 8
Efficiencies
# Threads
1 2 6 12
Global Efficiency 0.97 0.71 0.52 0.33
→ Parallel Efficiency 0.97 0.80 0.64 0.50
→ Computational Efficiency 1.00 0.89 0.82 0.66
• Investigate more deeply:• Time spent in parallel code vs. serial code
• Load balance of OpenMP loops
• OpenMP overhead
12 March 2019 9
Parallel efficiency
# Threads
1 2 6 12
Parallel Efficiency 0.97 0.80 0.64 0.50
% Parallel Code - 88.6% 75.0% 66.6%
Load Balance Efficiency 1.00 0.88 0.85 0.85
OpenMP Overhead Efficiency 1.00 1.00 0.99 0.98
12 March 2019 10
VTune
12 March 2019 11
VTune
12 March 2019 12
VTune
12 March 2019 13
VTune
12 March 2019 14
VTune
12 March 2019 15
VTune
12 March 2019 16
VTune
12 March 2019 17
VTune
12 March 2019 18
VTune
12 March 2019 19
Computational efficiency
# Threads
1 2 6 12
Computational Efficiency 1.00 0.89 0.82 0.66
IPC Efficiency 1.00 0.94 0.92 0.91
Instructions Efficiency 1.00 1.00 1.00 1.00
CPU Frequency Efficiency 1.00 0.94 0.89 0.72
12 March 2019 20
Computational performance
12 March 2019 21
Computational performance
12 March 2019 22
Computational performance
12 March 2019 23
Computational performance
12 March 2019 24
Computational performance
12 March 2019 25
CPU frequency
12 March 2019 26
CPU frequency
12 March 2019 27
CPU frequency
12 March 2019 28
CPU frequency
• A surprisingly large amount of time spent executing in serial
• One key OpenMP loop suffered from load imbalance
• CPU frequency was lower when the code was run on the maximum number of threads
12 March 2019 29
Performance Audit findings
• Although the code contained the correct OpenMP pragmas, the compiler found a particular region too complex to analyse and did not apply optimisations or OpenMP pragmas
• Fixed by removing an inline keyword
12 March 2019 30
Parallelising serial portions of code
• The main load imbalance was due to a call to pow() hitting a slow code-path when both base and exponent were close to 1
• This was resolved by scaling the base, raising it to the power, and then undoing the scaling
• Switching OpenMP loop scheduling to dynamic also helped
12 March 2019 31
Improving load balance
• The CPU frequency governor was set to ondemand by default, which meant that the frequency reduced when all 12 threads were active
• Fixed by adding --cpu-freq=performance to the Slurm job submission command
12 March 2019 32
Changing execution environment
12 March 2019 33
New efficiencies
# Threads
1 2 6 12
Global Efficiency 0.97 0.71 0.52 0.33
→ Parallel Efficiency 0.97 0.80 0.64 0.50
→ Computational Efficiency 1.00 0.89 0.82 0.66
# Threads
1 2 6 12
Global Efficiency 1.00 0.89 0.73 0.56
→ Parallel Efficiency 1.00 0.98 0.89 0.76
→ Computational Efficiency 1.00 0.91 0.82 0.74
• For the test case used in the Audit, these improvements meant the code ran 1.65x faster on 12 threads
• On a test case that was 100x larger they gave a 3x performance improvement on 12 threads
12 March 2019 34
Results
• Using metrics helps to break down aspects of performance and identify underlying opportunities for improvement
• Performance analysis tools provide insights into application behaviour
• Not always the case that you need to re-engineer significant portions of your code to achieve meaningful performance increases!
12 March 2019 35
Summary
12 March 2019 36
Contact:https://www.pop-coe.eumailto:[email protected]
@POP_HPC
This project has received funding from the European Union‘s Horizon 2020 research and innovation programme under grant agreement No 676553 and 824080.
Performance Optimisation and Productivity A Centre of Excellence in HPC