A Centre of Excellence in HPC - Case Study: 3x Speed … · 2019. 3. 13. · EU H2020 Centre of Excellence (CoE) 1 December 2018 –30 November 2021 Grant Agreement No 824080 Case

EU H2020 Centre of Excellence (CoE) 1 December 2018 – 30 November 2021

Grant Agreement No 824080

Case Study: 3x Speed Improvement for Zenotech's zCFD SolverNick Dingle, Numerical Algorithms Group Ltd.

• Performance Optimisation and Productivity

• A Centre of Excellence• Collaborative European project funded by Horizon 2020 programme

• Runs 1 December 2018 – 30 November 2021

• Providing Free Services within Europe• Precise understanding of parallel application and system behaviour

• Across application areas, platforms and scales

• Suggestions/support on how to rewrite code in the most productive way

• For academic and industrial codes and users

12 March 2019 2

The POP service

• Participating institutions:• Barcelona Supercomputing Center, Spain (coordinator)• HLRS, Germany• IT4Innovations, Czech Republic• Jülich Supercomputing Center, Germany• NAG, UK• RWTH Aachen, IT Center, Germany• TERATEC, France• Université de Versailles Saint-Quentin-en-Yvelines, France

• A team with:• Expertise in performance analysis and optimisation• Expertise in parallel programming models and practices• A research and development background and a

proven commitment to real academic and industrial use cases

12 March 2019 3

The POP team

• A density-based finite volume and Discontinuous Galerkin(DG) computational fluid dynamics (CFD) solver for steady-state or time-dependent flow simulation

• Decomposes domains using unstructured meshes

• Written in Python and C++ and parallelised with OpenMP and MPI

12 March 2019 4

zCFD by Zenotech

• Provides a quantitative measurement of the relative impact of the different factors inherent in parallelisation

• Uses a hierarchy of metrics, with each metric measuring a common cause of inefficiency in parallel programs

• Metrics are efficiencies between 0 and 1; higher numbers are better

• Efficiencies less than 0.8 are candidates for further investigation

5

POP methodology

12 March 2019

• The headline figure is Global Efficiency, which is the product of the Parallel and Computational Efficiencies

• Parallel Efficiency measures the effect that parallelising the code has on the runtime• E.g. how balanced between threads is the computational work, how much time is

lost to OpenMP overheads, etc

• Calculated as the ratio between the average amount of time that threads spend in useful computation (i.e. not in the OpenMP library or I/O) and the total runtime

• Computational Efficiency describes how well the computational load of the application scales with the number of threads• The ratio between the total time across all threads that the code spends in useful

computation, and the time that the serial code spends in useful computation

12 March 2019 6

The metrics

12 March 2019 7

Efficiencies

# Threads

1 2 6 12

Global Efficiency 0.97 0.71 0.52 0.33

→ Parallel Efficiency 0.97 0.80 0.64 0.50

→ Computational Efficiency 1.00 0.89 0.82 0.66

12 March 2019 8

Efficiencies

# Threads

1 2 6 12




• Investigate more deeply:• Time spent in parallel code vs. serial code

• Load balance of OpenMP loops

• OpenMP overhead

12 March 2019 9

Parallel efficiency

# Threads

1 2 6 12

Parallel Efficiency 0.97 0.80 0.64 0.50

% Parallel Code - 88.6% 75.0% 66.6%

Load Balance Efficiency 1.00 0.88 0.85 0.85

OpenMP Overhead Efficiency 1.00 1.00 0.99 0.98

12 March 2019 10

VTune

12 March 2019 11

VTune

12 March 2019 12

VTune

12 March 2019 13

VTune

12 March 2019 14

VTune

12 March 2019 15

VTune

12 March 2019 16

VTune

12 March 2019 17

VTune

12 March 2019 18

VTune

12 March 2019 19

Computational efficiency

# Threads

1 2 6 12

Computational Efficiency 1.00 0.89 0.82 0.66

IPC Efficiency 1.00 0.94 0.92 0.91

Instructions Efficiency 1.00 1.00 1.00 1.00

CPU Frequency Efficiency 1.00 0.94 0.89 0.72

12 March 2019 20

Computational performance

12 March 2019 21


12 March 2019 22


12 March 2019 23


12 March 2019 24


12 March 2019 25

CPU frequency

12 March 2019 26

CPU frequency

12 March 2019 27

CPU frequency

12 March 2019 28

CPU frequency

• A surprisingly large amount of time spent executing in serial

• One key OpenMP loop suffered from load imbalance

• CPU frequency was lower when the code was run on the maximum number of threads

12 March 2019 29

Performance Audit findings

• Although the code contained the correct OpenMP pragmas, the compiler found a particular region too complex to analyse and did not apply optimisations or OpenMP pragmas

• Fixed by removing an inline keyword

12 March 2019 30

Parallelising serial portions of code

• The main load imbalance was due to a call to pow() hitting a slow code-path when both base and exponent were close to 1

• This was resolved by scaling the base, raising it to the power, and then undoing the scaling

• Switching OpenMP loop scheduling to dynamic also helped

12 March 2019 31

Improving load balance

• The CPU frequency governor was set to ondemand by default, which meant that the frequency reduced when all 12 threads were active

• Fixed by adding --cpu-freq=performance to the Slurm job submission command

12 March 2019 32

Changing execution environment

12 March 2019 33

New efficiencies

# Threads

1 2 6 12




# Threads

1 2 6 12




• For the test case used in the Audit, these improvements meant the code ran 1.65x faster on 12 threads

• On a test case that was 100x larger they gave a 3x performance improvement on 12 threads

12 March 2019 34

Results

• Using metrics helps to break down aspects of performance and identify underlying opportunities for improvement

• Performance analysis tools provide insights into application behaviour

• Not always the case that you need to re-engineer significant portions of your code to achieve meaningful performance increases!

12 March 2019 35

Summary

12 March 2019 36

Contact:https://www.pop-coe.eumailto:[email protected]

@POP_HPC

This project has received funding from the European Union‘s Horizon 2020 research and innovation programme under grant agreement No 676553 and 824080.

Performance Optimisation and Productivity A Centre of Excellence in HPC

Documents

A Centre of Excellence in HPC - Case Study: 3x Speed … · 2019. 3. 13. · EU H2020 Centre of Excellence (CoE) 1 December 2018 –30 November 2021 Grant Agreement No 824080 Case