HPC Benchmarking and Performance Evaluation with Real-Life

HPC Benchmarking and

Performance Evaluation with

Real-Life Engineering Problem

using ANSYS 12.1

AMIR KHAN

Lecturer

Institute of Space Technology, Islamabad

13-NOV-2014

Sequence of Presentation

Introduction

HPC Benchmarking

Platform Description

Softwares used

Discussion on results

Conclusion

13/11/2014 2

Introduction

The purpose of benchmarking and performance evaluation

is to assess the performance and understand the

characteristics of HPC platforms

The performance could be assessed using LINPACK

(LINear algebra PACKage) which is used to solve many

millions of linear algebraic equations or by solving a real-

life problem using a commercial software/code

The main goal of such practice is the search for machines

those are the best suitable for industrial/research projects

A good practice is to evaluate the performance of the

computer cluster architecture, before its purchase, using

the same application software which a company uses for

their design

13/11/2014 3

HPC Benchmarking

HPC

High-performance computing (HPC) is the use of

parallel processing for running advanced application

programs efficiently, reliably and quickly

HPC benchmarking

Benchmarking is used to measure performance using a

specific indicator

13/11/2014 4

SHARED MEMORY PROCESSING (SMP) Usually same PC with multi-core and

multi-processors

Scalable up to about 64 or 128

processors but costly

Synchronization problem

DISTRIBUTED MEMORY PROCESSING (MPP)

Distributed among physically

different machines of same architecture

Scalable up to larger no. of Processors

Message passing overheads

Types of HPC Architectures

Network

CPU 1 CPU 2 CPU 3

memory memory memory

Symmetric Multi Processor

Network

CPU 1 CPU 2 CPU 3

memory memory memory

Distributed Memory Processing


13/11/2014 6

Head Node (01 node)

CPU Type AMD Opteron 6174, 12-core (Magny-cours, 2.2 GHz)

CPU numbers per node 2

Memory 64

Storage 1 TB

Compute Nodes (10 nodes)

CPU Type

Intel Xeon Westmere X5670 (2.93 GHz clockspeed, Hex-Core)

CPU numbers per node 2

Total cores 10 x 2 x 6 = 120

Memory 48 GB per node

Hard disk 300 GB per node

13/11/2014 7

Input file for

Calculates: u2

Input file for

Calculates: u1

Calculates:

P1 P2

A

u1

u2

u and K ,f ,f ,K,K2121

1 2

K2, f 2

K1, f 1

Sequence Diagram

Head Node

Compute

Node 1

Compute

Node 2

fuK


13/11/2014 8

Cluster Configuration

System interconnect Mellanox 5200 QDR Infiniband

Storage system Panasas 12 file server storage type

Total RAM 480GB

Total cores 120

Cluster TB1.1

Softwares used

13/11/2014 9

Benchmarking Software ANSYS 12.1

Solver Direct solver (dsparse)

Operating System RHEL5

Job scheduler PBS

Monitoring software Ganglia

DISCUSSION ON RESULTS

13/11/2014 10

Problem Description

A static structural analysis of a square bar was

carried out. The problem contains 8 millions DOFs.

One end of the bar is fully constrained while the

other end is subjected to the tensile loading. Number

of cores varies from 24 to 120 cores.

13/11/2014 11

Elapsed Time

13/11/2014 12

162.3

148.1

98.2

74.5

60.4 56.4

0

20

40

60

80

100

120

140

160

180

24 36 48 72 96 120

Time vs cores

Ela

pse

d T

ime

(M

inu

tes)

No. of Cores

Time curve starts getting steady/horizontal from

120 cores shows that further increase of number of

cores will not considerably decrease the

computational time, rather it might increase

Computational Time save (Speedup)

13/11/2014 13

0.00

8.70

39.49

54.11

62.79 65.24

0

10

20

30

40

50

60

70

24 36 48 72 96 120

speedup

No. of Cores

P

erc

en

tage

Tim

e s

ave

Cluster Performance

13/11/2014 14

100.1 100.7

164.6

236.0

338.2

406.9

0

50

100

150

200

250

300

350

400

450

24 36 48 72 96 120

Computational rate G

Flo

ps

No. of Cores

29 % of Peak performance at 120 cores

Problem Description

A steady state Thermal analysis problem was

analyzed

A bar with 2.4 MDOF was subjected to a constant

temperature at one end as shown

Number of cores varying from 6 to 30 cores

13/11/2014 15

Elapsed Time

13/11/2014 16

188

166

138

105

0

20

40

60

80

100

120

140

160

180

200

6 12 24 30

Time vs cores

No. of Cores

Elap

sed

Tim

e (s

eco

nd

s)

Computational Time save (Speedup)

13/11/2014 17

0.00

11.50

26.51

44.09

0

10

20

30

40

50

6 12 24 30

speedup

Perc

enta

ge (

%)

No. of Cores

Cluster Performance

13/11/2014 18

28.98

53.01

98.50

123.08

0

20

40

60

80

100

120

140

6 12 24 30

Computational rate GFl

op

s

No. of Cores

35 % of Peak performance at 30 cores

Conclusion

Computational time save of 65.24% could be

achieved while using 120 cores of TB1.1 blades

comparing to 24 cores while dealing with 8MDOF

problem

Computational time save of 44.09% could be

achieved using 30 cores comparing to cores while

handling a problem involving 2.4 MDOF

The results could be used to compare the

performance of the other HPC systems blades to

select the appropriate HPC system for an industry.

13/11/2014 19

Thanks for your attention

13/11/2014 20

Supplements

Peak Performance of a cluster 120x2.93x4 = 1.4 Tflops

Speedup = (t2-t1)/t1

13/11/2014 21

Documents

HPC Benchmarking and Performance Evaluation with Real-Life