Considerazioni riscaldamento distribuito forno Smalvic

Aerodynamics of a hi-performancevehicle: a parallel computing

application inside the Hi-ZEV project

Workshop “HPC enabling of OpenFOAM® for CFD applications”

26-28 november, CINECA, Casalecchio di Reno (BO), Italy

A. De Maio(1), V. Krastev(2), P. Lanucara(3), F. Salvadore(3)

(1) Nu.m.i.d.i.a. S. r. l.(2) Dept. of Industrial Engineering, University of Rome “Tor Vergata”(3) CINECA Roma, Dipartimento SCAI

Summary

• Hi-ZEV project outline

• Preliminary evaluation of the OpenFOAM® code

• Prototype car simulations: aerodynamic results and scalability/performance tests

• Conclusions



• Granted by the Italian Ministry of Economic Development’s program«Industria 2015 – Nuove Tecnologie per il Made in Italy»

• The project aim is the development of an Innovative High Performance Car with Low Environmental Impact based on an Electrical/Hybrid Powertrain

• The project started on 01/01/2011 and will last until 31/12/2013

Hi-ZEV: a collaborative industrial research project

Hi-ZEV: the partners


Technos Reat

Fondazione Italiana Nuove Comunicazioni

Icomet Microsistemi srl Elettromedia Advanced Devices spa

Dyesol Italia srl Leaff Engineering srl ISAM spa Concept Inn srl HPH Consulting

http://picchio.com/�

http://www2.fci.unibo.it/loghi-link-paginaweb-Faenza/enea.jpg�

http://www.bylogix.it/index.php?l=it�



Technos Reat




Team Leader and Project Coordinator






Technos Reat




Team Leader and Project Coordinator




Hi-ZEV: technical Key Points


Very light vehicle (low weight/power ratio)

High performance Hybrid Powertrain for a wide rangetorque availability

Very advanced chassis and suspensions for an excellentroad-holding

Accurate Fluid-Dynamic Design

Hi-ZEV: technical Key Points


Very light vehicle (low weight/power ratio)

High performance Hybrid Powertrain for a wide rangetorque availability

Very advanced chassis and suspensions for an excellentroad-holding

Accurate Fluid-Dynamic Design CFD

The role of CFD inside the project


• In the early, as well as in the more advanced design stages, CFD can beeffectively used to optimize:

1. the external aerodynamics of the vehicle;2. the underhood aerodynamics/thermal

management;3. The HVAC systems.

• The combination of an open source fully parallelized code (OpenFOAM®) with the the HPC infrastructure ofCASPUR/CINECA represents anincredibly powerful and efficientanswer to these needs.

OpenFOAM® + HPC

CFD

Externalaerodynamics Underhood HVAC

Preliminary simulations on the Matrix cluster


• Preliminary evaluation of OpenFOAM®

on the Matrix infrastructure

• Standard external aerodynamics test case (Ahmed body)

• OpenFOAM-1.7.1 + OpenMPI-1.4.2 + Scotch for decomposition

• Steady state solver (simpleFoam) on unstructured grids (up to 6*106 cells)

• High-Re RANS turbulence modeling(RNG/realizable k-e + WF)

• Up to 256 cores (32 nodes) involved

8 cores x node (2 x quad core AMD Opteron23xx @ 2.1 GHz)

320 nodes with 16 GB RAM each Infiniband DDR connection between nodes 20 Tflops peak perfomance, 177 Mflops/W

sustained performance

Preliminary simulations on the Matrix cluster: computational domain


Ahmed body results: wake flow structures, ϕ=25°


Symmetry plane 3D (Q- criterion, Q=1 04 s- 2)

(RKE)

(RNG)




(RKE)

(RNG)




(RKE)

(RNG)



(RKE)

(RNG)




(RKE)

(RNG)


Ahmed body results: velocity profiles in the symmetry plane


ϕ=25° ϕ=35°

Ahmed body results: velocity profiles in the symmetry plane


ϕ=25° ϕ=35°

Ahmed body results: integrated rearpressure drag


Rear pressure drag coefficients (ϕ =25) Total Difference (%)*

Slant Base

RKE 0.147 0.088 0.235 -13.3

RNG 0.147 0.083 0.230 -15.1

Lienhart et al. 0.156 0.115 0.271 -

Rear pressure drag coefficients (ϕ =35) Total Difference (%)*

Slant Base

RKE 0.110 0.107 0.217 -12.5

RNG 0.115 0.101 0.216 -12.9

Lienhart et al. 0.121 0.127 0.248 -

Comments:

Results are aligned with previous CFD studies on the 25°/35° configurations

The realizable k-ε captures fairlywell the relative drag reduction (~ 8%)in the 25° to 35° passage

Overall comparison:

Ahmed body results: some considerations about scalability


Case description:

•Finest grid (~6*106 cells)

•PCG linear solver on pressureequation

•64-96-128-256 cores (8-12-16-32 nodes) progression

Speedup specific efficiency

88

90

92

94

96

98

100

102

8-12 12-16 16-32

Spee

dup

spec

ifice

ffici

ency

(%)

Nodes increase

. . . speedup relative increases s e nodes relative increase=

Aaaaaaa

Ahmed body results: some considerations about scalability


Case description:

•Finest grid (~6*106 cells)

•PCG linear solver on pressureequation

•64-96-128-256 cores (8-12-16-32 nodes) progression

•Almost linear inter-node scaling(at least in the consideredinterval)

88

90

92

94

96

98

100

102

8-12 12-16 16-32

Spee

dup

spec

ifice

ffici

ency

(%)

Nodes increase

Speedup specific efficiency

Prototype car simulations


• Aims:1. Aerodynamic optimization of the Hi-ZEV

prototype external design;2. More systematic scalability tests on the

CASPUR/CINECA HPC infrastructures.

• Two hybrid (prisms+tetras) gridsconsidered:

1. 7.5*106 cells (symmetric);2. 15*106 cells (complete geometry).

• OpenFOAM-2.1.1 + Scotch

• Three architectures selected for the performance tests

8 cores x node (2 x quad core AMD Opteron23xx @ 2.1 GHz)

320 nodes with 16 GB RAM each Infiniband DDR connection between nodes 20 Tflops peak perfomance, 177 Mflops/W


Matrix (AMD Opteron)



12 cores x node (2 x six-core Intel X5650 “Westmere” @ 2.67 GHz )†

16 nodes with 48 GB RAM each Infiniband QDR connection between nodes 14.3Tflops peak perfomance, 785 Mflops/W


Jazz (Intel Xeon)

† Each node equipped also with 2 nVidia Tesla GPU computing units, not involved in the OpenFOAMsimulations

• Aims:1. Aerodynamic optimization of the Hi-ZEV

prototype external design;2. More systematic scalability tests on the

CASPUR/CINECA HPC infrastructures.







16 cores x node (IBM PPCA2 @ 1.6 GHz) 10240 nodes (163840 cores) with 16 GB

RAM each (1 GB x core) Network interface with 11 links ->5D Torus 2 Pflops peak perfomance

Fermi (BG/Q)• Aims:

1. Aerodynamic optimization of the Hi-ZEVprototype external design;

2. More systematic scalability tests on the CASPUR/CINECA HPC infrastructures.





Prototype car simulations: computationaldomain


movingfloor

inlet

half car

symmetryplane

outlet

top

side

Prototype car simulations: aerodynamicresults (OF vs. Fluent)


OpenFOAM® settings:

•Symmetrical prism/tetra grid(exactly the same for both codes)

•simpleFoam pressure-based solver

•Realizable k-ε for turbulence + standard WF

•TVD scheme for momentumconvection, upwind for k/ε

Fluent settings:

•Symmetrical prism/tetra grid(exactly the same for both codes)

•pressure-based solver

•Realizable k-ε for turbulence + non-equilibrium WF

•Second-order upwind scheme for allconvective terms



OpenFOAM® Fluent

Aerodynamic coefficients

Cd = 0.32, CL = 0.14 Cd = 0.31, CL = 0.17



Pressure distribution around the car, y=0 (symmetry plane)

Fluent, 6000 iterations

OpenFOAM, 4500 iterations

212

pp pC

Uρ

∞

∞ ∞

−=

212

pp pC

Uρ

∞

∞ ∞

−=



Pressure distribution around the car, y=- 0. 4



212

pp pC

Uρ

∞

∞ ∞

−=

212

pp pC

Uρ

∞

∞ ∞

−=



Pressure distribution around the car, y=- 0. 7



212

pp pC

Uρ

∞

∞ ∞

−=

212

pp pC

Uρ

∞

∞ ∞

−=



Total pressure distribution around the car, y=0 (symmetry plane)



,

tpt

t

p pCp p

∞

∞ ∞

−=

−

,

tpt

t

p pCp p

∞

∞ ∞

−=

−



Total pressure distribution around the car, y=- 0. 4



,

tpt

t

p pCp p

∞

∞ ∞

−=

−

,

tpt

t

p pCp p

∞

∞ ∞

−=

−



Total pressure distribution around the car, y=- 0. 7



,

tpt

t

p pCp p

∞

∞ ∞

−=

−

,

tpt

t

p pCp p

∞

∞ ∞

−=

−



Total pressure distribution around the car, z=0. 1 1



,

tpt

t

p pCp p

∞

∞ ∞

−=

−

,

tpt

t

p pCp p

∞

∞ ∞

−=

−

Prototype car simulations: inter-nodescalability tests (Matrix vs. Jazz)


Speedup, Matrix vs Jazz, PCG

0

4

8

12

16

20

24

0 4 8 12 16 20

Spee

dup

Number of nodes

Matrix, PCG

Jazz, PCG

Case description:

•Symmetrical grid (~7.5*106 cells)

•PCG and GAMG linear solver on pressure equation

•50 iterations monitoring, startingfrom a fairly converged solution

•The computing node is selected asthe fundamental unit

1( )( )

node

N nodes

time per stepspeedup time per step−

−

− −= − −


Speedup, Matrix vs Jazz, GAMG


0

4

8

12

16

0 4 8 12 16 20

Spee

dup

Number of nodes

Matrix, GAMG

Jazz, GAMG

Case description:





1( )( )

node

N nodes


−

− −= − −


Speedup, Matrix, GAMG vs PCG


0

4

8

12

16

20

24

0 8 16 24 32

Spee

dup

Number of nodes

Matrix, PCG

Matrix, GAMG

Case description:





1( )( )

node

N nodes


−

− −= − −


Speedup, Jazz, GAMG vs PCG


0

4

8

12

16

20

24

0 4 8 12 16 20

Spee

dup

Number of nodes

Jazz, PCG

Jazz, GAMG

Case description:





1( )( )

node

N nodes


−

− −= − −


Comments:

The PCG solver clearly outperformsGAMG when the parallelization startsto become extensive (approximatelyabove 100 processes for the half-carcase)

Jazz appears to scale better thanMatrix, probably because of the more capable infiniband network (QDR vs DDR) and of better cache “filling” asthe single processes become smaller

Case description:







Time- per- step, Matrix, GAMG vs PCG

Prototype car simulations: absolute and single-node performances (Matrix vs. Jazz)

Case description:




•Time-per-step evaluated on a per-core basis

0

10

20

30

40

50

60

70

8 16 32 64 128 256

time

(s)

Number of cores

Matrix, PCG

Matrix, GAMG


Time- per- step, Jazz, GAMG vs PCG


Case description:





0

5

10

15

20

25

30

12 24 48 96 192

time

(s)

Number of cores

Jazz, PCG

Jazz, GAMG


Time- per- step, single- node, Matrix, GAMG vs PCG


Case description:





0

50

100

150

200

250

300

1 2 4 8

time

(s)

Number of cores

Matrix, PCG

Matrix, GAMG



Case description:





Time- per- step, single- node, Jazz, GAMG vs PCG

0102030405060708090

100

1 2 6 12

time

(s)

Number of cores

Jazz, PCG

Jazz, GAMG



Case description:





Comments:

Though the very inefficient intra-node scaling, the newer Intel arch. is(as expected) much faster than the AMD one

If the procs. number is kept in the “acceptable scaling range”, the GAMG solver is always faster than the PCG one (e. g. 40% faster on 64 Matrix cores)


Speedup efficiency, 1 6 ppn, PCG vs GAMGCase description:




•16 and 32 MPI processes per node considered

Prototype car simulations: scalabilitytests (Fermi, symmetrical grid)

0

20

40

60

80

100

120

2 4 8 16 32 64 128 256

Spee

dup

effic

ienc

y (%

)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, GAMG, 16 PPN

1 1· ·( ). .(%) 100 ( )

node

N nodes Ntime per steps e time per step

−

−

− −= − −



Case description:





0

20

40

60

80

100

120

2 4 8 16 32 64

Spee

dup

effic

ienc

y (%

)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, PCG, 32 PPN

Speedup efficiency, PCG, 1 6 ppn vs. 32 ppn

1 1· ·( ). .(%) 100 ( )

node

N nodes Ntime per steps e time per step

−

−

− −= − −



Case description:





0

20

40

60

80

100

120

2 4 8 16 32 64

Spee

dup

effic

ienc

y (%

)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, PCG, 32 PPN

Speedup efficiency, PCG, 1 6 ppn vs. 32 ppn

What about absolute performance?



Case description:





Time- per- step, PCG, 1 6 ppn vs. 32 ppn

Apparently usingo more ppn could be beneficial in terms of absolute performance, butactually when the number of nodes reaches a “practical” value (64) the benefit vanishes, and in addition…

0

5

10

15

20

25

30

35

2 4 8 16 32 64

time

(s)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, PCG, 32 PPN


Output generation time, PCG, 1 6 ppn vs. 32 ppn

Prototype car simulations: I/O performance tests (Fermi, symmetrical grid)

Case description:


•PCG linear solver on pressure

•Output generation time andinitialization time monitored


05

101520253035404550

4 8 16 32 64 128

time

(s)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, PCG, 32 PPN


Initialization time, PCG, 1 6 ppn vs. 32 ppn

Prototype car simulations: I/O performance tests (Fermi, symmetrical grid)

Case description:


•PCG linear solver on pressure

•Output generation time andinitialization time monitored

•16 and 32 MPI processes per node considered 0

50

100

150

200

250

4 8 16 32 64 128

time

(s)

Number of nodes

Fermi, PCG, 16 PPN

Fermi, PCG, 32 PPN


Prototype car simulations: commentsabout Fermi runs (symmetrical grid)

Comments:

The case is of course too small to prove Fermi’s real potential, but…

…up to the minimum “practical” nodenumber (64) the SIMPLE iteration scalingis acceptable (PCG)

…when the I/O capability of the nodesgets actually saturated, a dramatic dropin the I/O efficiency occurs (and thingsget even worse with 32 ppn)

Case description:






Time- per- step, PCG, symm. vs. doubledCase description:

•Doubled grid (~15*106 cells)

•PCG solver on pressure equation

•Only 16 ppn considered

•Comparison made assuming the samemesh-per-node load distribution (i. e. doubling the number of nodes forthe bigger grid)

Further simulations on Fermi: doubledgrid

0

0,5

1

1,5

2

2,5

3

32-64 64-128 128-256

time

(s)

Number of nodes (symm-double)

Fermi, PCG, symm

Fermi, PCG, double



Case description:





O. g. t. , PCG, symm. vs. doubled

0

5

10

15

20

25

30

35

40

32-64 64-128 128-256

time

(s)


Fermi, PCG, symm

Fermi, PCG, double



Case description:





I. t. , PCG, symm. vs. doubled

0

100

200

300

400

500

600

32-64 64-128 128-256

time

(s)


Fermi, PCG, symm

Fermi, PCG, double



Comments:

The SIMPLE iteration weak-scalingperformance appears fairly good and thus should encourage more tests on bigger cases, but…

…the I/O issues are confirmed

Case description:





Conclusions (1)

• Hi-ZEV a is successful example of how industry can take advantagefrom the combination of parallelized open-source CFD toolkits and highly qualified HPC infrastructures, in a collaborative project framework

• The OpenFOAM® code has been evaluated on “conventional” AMD and Intel HPC facilities for external aerodynamics applications, showing:– Good accuracy compared to well established commercial CFD codes;– Interesting parallel performances (still not totally exploited), at least for

small/medium size cases (~ 107 cells) and depending on the optimal pressuresolver choice (PCG scales better, GAMG is faster for smal procs. numbers)


Conclusions (2)

• The OpenFOAM® performances have been assessed also on the BG/Q supercomputer Fermi and, in spite of the (relatively) smallsize of the considered cases, the following remarks can beextracted:– The solver iteration scaling performances are promising (with PCG), especially in

the perspective of coping with much bigger problems;– Though for the considered cases a more conventional architecture (e. g. Intel

Xeon) seems to be a better choice, a deeper investigation should be made in order to include also performance vs. energy consumption aspects;

– Unfortunately, for massively parallel applications (thousands of processes) a dramatic I/O efficiency question rises (further evaluation needed)


Aknowledgments


(1) Nu.m.i.d.i.a. S. r. l.(2) Dept. of Industrial Engineering, University of Rome “Tor Vergata”(3) CINECA Roma, Dipartimento SCAI

A. De Maio(1), V. Krastev(2), P. Lanucara(3), F. Salvadore(3)

M. Testa(1) (for providing the half-car grid and Fluent results)


Workshop “HPC enabling of OpenFOAM® for CFD applications”

Documents

Considerazioni riscaldamento distribuito forno Smalvic