29
www.cineca.it OpenFOAM on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA

OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

  • Upload
    vanhanh

  • View
    225

  • Download
    2

Embed Size (px)

Citation preview

Page 1: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

OpenFOAM on BG/Q porting and performance

Paride Dagna, SCAI Department, CINECA

Page 2: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

OpenFOAM : selected application inside of PRACE project

Fermi : PRACE Tier-0 System

Model: IBM-BlueGene /Q

Architecture: 10 BGQ Frame with 2 MidPlanes each

Front-end Nodes OS: Red-Hat EL 6.2

Compute Node Kernel: lightweight Linux-like kernel

Processor Type: IBM PowerA2, 16 cores, 1.6 GHz

Computing Nodes: 10.240

Computing Cores: 163.840

RAM: 16GB / node

Internal Network: Network interface

with 11 links ->5D Torus

Disk Space: more than 2PB of scratch space

Peak Performance: 2.1 PFlop/s

SYSTEM OVERVIEW

Page 3: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Single Chip Module

Compute card: One chip module,

16 GB DDR3 Memory

SYSTEM OVERVIEW

Compute node (back-end): • each compute node comprise 17 cores on a single chip

with16 GB of dedicated physical memory

• Applications run on 16 of the cores with the 17th core reserved for system software.

• Nearly the full 16 GB of physical memory is dedicated to application usage.

• On each core it’s possible to run up to 4 processes/threads for a total of 64 processes/threads per node

Applications : • Applications are submitted to the compute nodes by the

batch scheduler system • To run on the compute nodes (back-end), applications

must be cross-compiled

Page 4: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Porting of OpenFOAM on BG/Q

Compiling OpenFOAM for the back-end nodes on BG/Q requires some system specific changes to the configuration scripts of OpenFOAM and Third-party package

It’s not possible to use Third-party MPI, rules for BG/Q MPI must be inserted

Environment configuration:

• Configure environment with compilers and zlib using modules

module load bgq-gnu

module load zlib

OpenFOAM configuration scripts and rules:

• Files “bashrc” and “settings.sh” must be changed inserting the rules for BG/Q MPI

• Files c/c++ in wmake/rules folders must be modified for dynamic linking

Scotch library build

• Before running “Allwmake” in the OpenFOAM main folder some changes need to be made to the compiling and dynamic linking rules in the file “Makefile .inc” contained in the scotch library.

• Cross-compile and execute on the back-end the “dummysizes” scotch utility to build properly the header files scotch.h and scotchf.h

Compile

• Go in $WM_PROJECT/$WM_PROJECT_VERSION and compile with ./Allwmake

Page 5: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Performance of OpenFOAM on BG/Q

Test cases Cavity 3D

Isothermal Incompressible Flow

Solver : icoFoam

BoxTurb 3D Omogeneus Isotropic Turbulence on compressible flow

Solver : sonicFoam

Airfoil – wing section External aerodynamic

Solver : simpleFoam

Dtmb hull Marine hydrodynamics

Solver : interFoam

Page 6: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Performance of OpenFOAM on BG/Q

Systems

Model: IBM-BlueGene /Q (Fermi)

Processor Type: IBM PowerA2, 1.6 GHz

Computing Node: 16 cores

RAM: 16GB / node; 1GB/core

Internal Network: Network interface

with 11 links ->5D Torus

Model: Hewlett Packard C7000 (Lagrange)

Processor Type: Intel, Xeon Westmere,

2.8 GHz

Computing Node: 12 cores

RAM: 24GB / node; 2GB/core

Internal Network: Infiniband QDR/DDR Voltaire, Fat Tree

Page 7: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D

Flow : laminar, isothermal, incompressible

Mesh : fully structured 3D

Mesh elements : cubes

Elements 10.000.000

Scotch Simple

icoFoam

Elements 20.000.000

Scotch Simple

icoFoam

Page 8: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D Speed up and Efficiency

Mesh :10.000.000

Solution saved at final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

Page 9: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D Speed up and Efficiency

Mesh :10.000.000

Solution saved every 10 time steps

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 10: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D – Profiling

0%

50%

100%

150%

200%

250%

64 128 256 512 1024

Incr

em

en

t %

# cores

I/O overhead on simulation time

Fermi Lagrange

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 13,0 5,10

128 14,0 2,50

256 14,0 1,33

512 15,0 0,75

1024 22,0 0,40

Number of iterations : 100

Files per core : 3

MPI_Allreduce average message size per core (B) : 8 -- #cores 1024

Average message size sent and received per core (KB) : 4,6 -- #cores 1024

MPI and I/O profiling : 512 cores

MPI and I/O profiling : 1024 cores

Page 11: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D Speed up and efficiency

Mesh :20.000.000

Solution saved at final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 12: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D Speed up and efficiency

Mesh :20.000.000

Solution saved every 10 time steps

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 13: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Cavity – 3D – Profiling

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 18,1 9,46

128 18,1 4,73

256 18,5 2,42

512 22,5 1,27

1024 23,1 0,63

MPI and I/O profiling : 512 cores

MPI and I/O profiling : 1204 cores

0%

50%

100%

150%

200%

250%

300%

64 128 256 512 1024

% In

cre

me

nt

# cores

I/O overhead on simulation time

Fermi Lagrange

Number of iterations : 100

Files per core : 3

MPI_Allreduce average message size per core (B) : 8 -- #cores 1024

Average message size sent and received per core (KB) : 6,4 -- #cores 1024

Page 14: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

BoxTurb – 3D

Flow : compressible

Case study : homogeneous, isotropic turbulence

Mesh : uniform 3D

Number of cells : ≈ 17.000.000

Solver : sonicFoam

Partition method : simple

Courtesy of : Matteo Cerminara (INGV), Pisa

Page 15: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

BoxTurb – 3D Speed up and

efficiency

Solution saved at the final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048

Effi

cie

ncy

# cores

Patition method - simple

Fermi Lagrange Ideal

0

500

1000

1500

2000

2500

64 128 256 512 1024 2048

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

Page 16: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

BoxTurb – 3D Speed up and

efficiency

Solution saved every 10 time steps

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Patition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

Page 17: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

BoxTurb – 3D – Profiling

0%

20%

40%

60%

80%

100%

120%

140%

64 128 256 512 1024

Incr

em

en

t %

# cores

I/O overhead on simulation time

Fermi Lagrange

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 18,4 4,50

128 18,4 2,25

256 18,6 1,14

512 19,6 0,60

1024 21,2 0,32

MPI and I/O profiling : 512 cores

MPI and I/O profiling : 1024 cores

Number of iterations : 180

Files per core : 4

MPI_Allreduce average message size per core (B) : 8 -- #cores 1024

Average message size sent and received per core (KB) : 9,3 -- #cores 1024

Page 18: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Airfoil – wing section

Flow : turbulent, incompressible

Case study : steady state, extruded NACA airfoil

Mesh : fully structured 3D

Number of cells : ≈ 9.000.000

Solver : simpleFoam

Method : simple - scotch

Page 19: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Airfoil – wing section - Speed up

and efficiency

Solution saved at the final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 20: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Airfoil – wing section – Profiling

MPI profiling – simple - 512 cores

MPI profiling – scotch - 512 cores

MPI profiling – simple - 512 cores

MPI profiling – scotch - 512 cores

Page 21: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Airfoil – wing section - Speed up

and efficiency

Solution saved every 100 time steps

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 22: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Airfoil – wing section – Profiling

MPI and I/O profiling : 1024 cores

MPI and I/O profiling : 512 cores

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 5,6 1,46

128 5,8 0,76

256 6,6 0,43

512 7,9 0,26

1024 12,0 0,20

Number of iterations : 1000

Files per core : 6

MPI_Allreduce average message size per core (B) : 8 -- #cores 512

Average message size sent and received per core (KB) : 4,2 -- #cores 512

0%

10%

20%

30%

40%

50%

60%

70%

80%

64 128 256 512 1024

Spe

ed

up

# cores

Decomposition method - scotch

Fermi Lagrange

Page 23: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Free surface - dtmb hull – 3D

Flow : turbulent, incompressible

Case study : unsteady, multiphase

Mesh : unstructured 3D

Number of cells : ≈ 5.500.000

Solver : interFoam

Method : simple - scotch

Page 24: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Free surface - dtmb hull – 3D Speed

up and efficiency

Solution saved at the final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

32 64 128 256 512

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

Page 25: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Free surface, dtmb hull – 3D Speed

up and efficiency

Solution saved every 10 time steps

0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores

Partition method - scotch

Fermi Lagrange Ideal

0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores

Partition method - scotch

Fermi Lagrange Ideal

0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores

Partition method - simple

Fermi Lagrange Ideal

0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

Page 26: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Free surface - dtmb hull – 3D -

Profiling

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 18,4 4,50

128 18,4 2,25

256 18,6 1,14

512 19,6 0,60

Number of iterations : 100

Files per core : 8

MPI_Allreduce average message size per core (B) : 8 -- #cores 512

Average message size sent and received per core (KB) : 29,4 -- #cores 512

0%10%20%30%40%50%60%70%80%90%

100%

32 64 128 256 512

Incr

em

en

t %

# cores

I/O overhead on simulation time

Fermi

Lagrange

MPI and I/O profiling : 256 cores

MPI and I/O profiling : 512 cores

Page 27: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Conclusions

OpenFOAM scaling and efficiency performance on Fermi and classic HPC systems are comparable but for well suited case studies with a good balancing between computation, I/O and MPI communications we could benefit from the larger amount of available cores on Fermi.

OpenFOAM efficiency and scaling are constrained by poor I/O design and intra-process communication

A new scheme of I/O based on MPI Parallel I/O routines or available parallel I/O libraries, able to use efficiently parallel file system facilities, should dramatically reduce I/O overhead

A multi-threaded hybrid MPI/OpenMP version of the solvers will indeed mitigate the time spent in MPI routines with the increase in the number of cores.

Page 28: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Acknowledgements

Bob Danani VLSCI Carlton, Melbourne

Matteo Cerminara INGV

Massimiliano Culpo CINECA

Piero Lanucara CINECA

Andrea Penza CINECA

Francesco Salvadore CINECA

Ivan Spisso CINECA

Page 29: OpenFOAM on BG/Q porting and performance - Prace ... on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA OpenFOAM: selected application inside of PRACE project Fermi

www.cineca.it

Questions ?