Runtime Performance Optimizations for an OpenFOAM Simulation

science + computing ag IT Service and Software Solutions for Complex Computing Environments Tuebingen | Muenchen | Berlin | Duesseldorf

Runtime Performance Optimizations for an OpenFOAM Simulation

Dr. Oliver Schröder HPC Services and Software

21 June 2014

HPC Advisory Council European Conference 2014

Many Thanks to…

© 2014 science + computing ag Dr. Oliver Schröder | [email protected]

Applications and Performance Team Madeleine Richards

Rafael Eduardo Escovar Mendez

HPC Services and Software Team Fisnik Kraja

Josef Hellauer

2

Overview

Introduction The Test Case The Benchmarking Environment Initial Results (the starting point)

Dependency Analysis

• CPU Frequency • Interconnect • Memory Hierarchy

Performance Improvements

• MPI Communications Tuning MPI Routines Intel MPI vs. Bull MPI

• Domain Decomposition

Conclusions


3

Introduction


How can we improve application performance? • By using the resources efficiently

• By selecting the appropriate resources

In order to do this we have to: • Analyze the behavior of the application

• And then tune the runtime environment accordingly

In this study we: • Analyze the dependencies of an OpenFOAM Simulation

• Improve the performance of OpenFOAM by

• Tuning MPI communications

• Selecting the appropriate domain decomposition

4

The Test Environment


5

Compute Cluster Sid Robin

Hardware

Processor Intel E5-2680 V2 Ivy-Bridge Intel E5-2697 V2 Ivy-Bridge

Frequency 2.80 GHz 2.70 GHz

Cores per processor 10 12

Sockets per node 2 2

Cores per nodes 20 24

Cache 25 MB (L3), 10x256 KB (L2) 30 MB (L3), 12 x 256 KB (L2)

Memory 64 GB 64 GB

Frequency of memory 1866 MHz 1866 MHz

IO – FS NFS over IB NFS over IB

Interconnect IB FDR IB FDR

Software

OpenFOAM 2.2.2 2.2.2

Intel C Compiler 13.1.3.192 13.1.3.192

GNU C Compiler 4.7.3 4.7.3

Compiler Options -O3 –fPIC –m64 -O3 –fPIC –m64

Intel MPI 4.1.0.030 4.1.0.030

Bull MPI 1.2.4.1 1.2.4.1

SLURM 2.6.0 2.5.0

OFED 1.5.4.1 1.5.4.1

Initial Results on SID (E5-2680v2) (1)

GCC(IMPI) vs. ICC (IMPI) Better performance is obtained with the OpenFOAM version compiled with GCC During the development and optimization of OpenFOAM mainly GCC is used

8 vs. 10 Tasks / CPU Better Performance with 8 Tasks/CPU. The reasons might be:

With fewer tasks we use fewer cores in each CPU, having so more cache and more memory bandwidth per core.

The domain decomposition is different. We will see later the impact of Domain Decomposition on Performance


6

256 Tasks 512 Tasks 768 Tasks

16 Nodes 32 Nodes 48 Nodes

GCC 1767 1019 869

ICC 1806 1080 884

0200400600800

100012001400160018002000

Clo

ck T

ime

[s]

GCC vs. ICC


8 Tasks/CPU 1767 1019 869

10 Tasks/CPU 1799 1116 1010

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Clo

ck T

ime

[s]

8 vs. 10 Tasks /CPU



7



No_CPU_Binding 1801 1121 1007

With_CPU_BINDING 1799 1116 1010

0200400600800

100012001400160018002000

Clo

ck T

ime

[s]

OpenFOAM: CPU Binding

DAPLFabric

OFA Fabriczone

reclaimhuge pages

hyperthreading

turboboost

16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes

320 Tasks 320 Tasks 320 Tasks 320 Tasks 640 Tasks 320 Tasks

Clock Time [s] 1804 1845 1802 1807 2229 1750

0

500

1000

1500

2000

2500

Se

con

ds

OpenFOAM Tests on 16 Nodes

0,0

500,0

1000,0

1500,0

2000,0

2500,0

3000,0

3500,0

4000,0

120 Tasks 240 Tasks 480 Tasks 960 Tasks

6 Nodes 12 Nodes 24 Nodes 48 Nodes

Tim

e [

s]

Example results with Star-CCM+

(cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading)

Opt: none

Opt: cb

Opt: cb,zr

Opt: cb,zr,thp

Opt: cb,zr,thp,trb

Opt: cb,rz,thp,ht


CPU Binding With other applications we often observed performance improvements when we

bind the Tasks to specific CPUs (especially when Hyperthreading=ON)

This seems not to be the case for this OpenFOAM simulation

We applied also other optimization techniques but we observed improvement only in the case of Turbo Boost. Actually with Turbo Boost we expected more than 3% improvement.

We really need to analyze the dependencies of OpenFOAM in order to understand this behavior.


8

Overview


Dependency Analysis





Conclusions


9

CPU Comparison E5-2680v2 vs. E5-2697v2


Core-to-Core Comparison 16 nodes: better on E5-2697v2 CPUs considering the additional 5 MB of L3 cache

24 nodes: very small differences – MPI overhead hides the benefits from the cache

Node-to-Node Comparison 16 nodes: 2% better on E5-2697v2 CPUs

24 nodes: <1% better on E5-2697v2 CPUs

10

256 Tasks 320 Tasks 384 Tasks 480 Tasks

16 Nodes 16 Nodes 24 Nodes 24 Nodes

E5-2680v2 on SID 1912 1925 1364 1345

E5-2697v2 on Robin 1849 1803 1355 1344

0

500

1000

1500

2000

2500

Clo

ck T

ime

[s]

Core-to-Core Comparison OpenFOAM @ 2.4 GHz

16 Nodes 24 Nodes

[email protected] - E5-2680v2 1799 1294

[email protected] - E5-2697v2 1765 1284

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Clo

ck T

ime

[s]

Node-to-Node Comparison OpenFOAM on Fully Populated Nodes

OpenFOAM Dependency on CPU Frequency Tests on SID B71010c (E5-2680v2)

Dependency on the CPU frequency only modest, approx 45%

Beyond 2.5 GHz there is not much improvement


11

y = 69878x-0,463 R² = 0,9618

y = 36256x-0,441 R² = 0,9697

y = 25120x-0,408 R² = 0,9702

0

500

1000

1500

2000

2500

3000

1000 1500 2000 2500 3000

Wa

ll c

lock

tim

e i

n S

eco

nd

s

CPU frequency in MHz

16 Nodes - 320 Tasks



Pot.(16 Nodes - 320 Tasks)



OpenFOAM Dependency on Interconnect Compact Task Placement vs. Fully Populated


With the compact task placement we bind all 10 tasks on one of the CPUs in each node to make sure that cache size and memory bandwidth stay the same. In this way we give to each task more interconnect bandwidth. For tests with 320 tasks we obtain 3% improvement For tests with 640 tasks we obtain 11% improvement

Conclusion: Dependency on Interconnect Bandwidth increases with the number of tasks (nodes)

12

320 Tasks 640 Tasks

Speedup 1,03 1,11

1,00

1,02

1,04

1,06

1,08

1,10

1,12

Compact vs. Fully Populated Test

Fully Populated: 20 Tasks/Node

320 Tasks on 16 Nodes


Compact: 10 Tasks/Node (2x #Nodes)



OpenFOAM Dependency on Memory Hierarchy Scatter vs. Compact Task Placement

The dependency on the memory hierarchy is bigger when running OpenFOAM on a small number of nodes. On more nodes, the bottleneck is not anymore the memory hierarchy but the interconnect.

The impact of the L3 cache stays almost the same when doubling the number of nodes. However, the impact coming from the memory throughput is lower.


13

28%

4%

22%

24%

0%

10%

20%

30%

40%

50%

60%

320 Tasks 640 Tasks

32 Nodes 64 Nodes

Improvement in Memory Bandwidth

Improvement from the L3 Cache

1,50

1,28

1,00

1,10

1,20

1,30

1,40

1,50

1,60

320 Tasks 640 Tasks

32 Nodes 64 Nodes

Speedup

OpenFOAM Dependency on Memory Hierarchy Scatter vs. Compact Task Placement

We tested OpenFOAM on Lustre with various Stripe Counts Stripe Sizes

We could not obtain any improvement


14

0

200

400

600

800

1000

1200

1400

1600

1800

2000

16 32 48

Clo

ck T

ime

[s]

Number of Nodes (20 Tasks/Node)

1 Stripe of 1 MB

2 Stripes of 1 MB

4 Stripes of 1 MB

6 Stripes of 1 MB

1 Stripe of 2 MB

1 Stripe of 4 MB

1 Stripe of 6MB

Overview


Dependency Analysis





Conclusions


15

MPI Profiling (with Intel MPI) Tests on SID B71010c (E5-2680v2)

The portion of time spent in MPI increases with the number of nodes

Looking at the MPI Calls ~ 50 % of MPI Time is spent on MPI_Allreduce ~ 30 % of MPI Time is spent on MPI_Waitall


16

76%

58%

42%

23,5%

41,6%

57,9%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%



MPI User

48,6% 53,5% 51,2%

31,9% 29,6% 28,4%

15,1% 11,7% 14,0%

4,4% 5,2% 6,4%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%



MPI_Allreduce MPI_Waitall MPI_Recv Other

MPI All Reduce Tuning (with Intel MPI) Tests on SID B71010c (E5-2680v2)

The best All_Reduce algorithm: (5) Binominal Gather + Scatter

Overall obtained improvement None on 16 nodes – 320 Tasks 11% on 32 Nodes – 640 Tasks


17

1. Recursive doubling algorithm

2. Rabenseifner's algorithm

3. Reduce + Bcast algorithm

4. Topology aware Reduce + Bcast algorithm

5. Binomial gather + scatter algorithm

6. Topology aware binominal gather + scatter algorithm

7. Shumilin's ring algorithm

8. Ring algorithm

1797 1893 2075 2232 1864 1801 1860

2208

2813

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms


1117 1210 1597

4096

1490 1004

1526

3985

7373

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms


Finding the Right Domain Decomposition Tests on Robin with E6-2697v2 CPUs

We obtain 10 - 14% improvement on 16 nodes 3 – 5 % improvement on 24 nodes

This means that decomposition improves data locality impacting thus the intra node communication, but it cannot have a high impact when the inter-node communication overhead increases.


18

1765 1546 1550

1592

1,000

1,142 1,139

1,109

0,800

0,900

1,000

1,100

1,200

1,300

1,400

1,500

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 2 2 2 3 2 4 4 24 2 3 4 4 6 8 8

2 96 4 6 4 8 4 24 4 12 8 6 8 8 6 8

96 2 48 32 32 24 24 4 4 16 16 16 12 8 8 6S

pe

ed

up

Clo

ck T

ime

[s]

Domain Decomposition (Bottom Up: Y Z X)

OpenFOAM on 16 Nodes with 24 Tasks/Node

Time [s]

Speedup

1284 1246 1224

1249

1,000 1,030 1,049 1,028

0,900

0,950

1,000

1,050

1,100

1,150

1,200

1,250

1,300

0

200

400

600

800

1000

1200

1400

2 2 2 3 2 4 2 3 2 3 4 2 4 3 4 6 4 6 8

2 4 6 4 8 4 9 6 12 8 6 16 8 12 9 6 12 8 8

144 72 48 48 36 36 32 32 24 24 24 18 18 16 16 16 12 12 9

Sp

ee

ed

up

Clo

ck T

ime

[s]

Domain Decomposition (Bottom Up : Y Z X)

OpenFOAM on 24 Nodes (24 Tasks/Node)

Time [s] Speedup

Bull MPI vs. Intel MPI Tests on Robin with E6-2697v2 CPUs

Bull MPI is based on OpenMPI

It has been enhanced with many features such as: effective abnormal pattern detection network-aware collective operations multi-path network failover.

It has been designed to boost the

performance of MPI applications thanks to full integration with bullx PFS (based on Lustre) bullx BM (based on SLURM)


19

0

200

400

600

800

1000

1200

1400

1600

1800

2000

X=2 X=2 X=2 X=2

Z=2 Z=6 Z=2 Z=9

Y=96 Y=32 Y=144 Y=32

384 Tasks 576 Tasks

16 Nodes 24 Nodes

Intel MPI

Bull MPI

6.2%

3.2%

1.4%

3.9%

Conclusions


1. CPU Frequency • OpenFOAM shows a dependency under 50% on the CPU Frequency

2. Cache Size

• OpenFOAM is dependent on the Cache Size per Core

3. Limiting factor changes as number of nodes change: 1. Memory Bandwidth:

OpenFOAM is memory bandwidth dependent on a small number of nodes 2. Communication: MPI Time increases substantially with increasing number of nodes

4. Tuning MPI All Reduce Operations

• Up to 11 % improvement can be obtained

5. Domain Decomposition • Finding the right mesh dimensions is very important

6. File System

• We could not obtain better results with different stripe counts or size in Lustre

20

Thank you for your attention.

science + computing ag www.science-computing.de

Oliver Schröder

Email: [email protected]

Technology

Runtime Performance Optimizations for an OpenFOAM Simulation