Upload
lydat
View
218
Download
1
Embed Size (px)
Citation preview
HPC Benchmarking and
Performance Evaluation with
Real-Life Engineering Problem
using ANSYS 12.1
AMIR KHAN
Lecturer
Institute of Space Technology, Islamabad
13-NOV-2014
Sequence of Presentation
Introduction
HPC Benchmarking
Platform Description
Softwares used
Discussion on results
Conclusion
13/11/2014 2
Introduction
The purpose of benchmarking and performance evaluation
is to assess the performance and understand the
characteristics of HPC platforms
The performance could be assessed using LINPACK
(LINear algebra PACKage) which is used to solve many
millions of linear algebraic equations or by solving a real-
life problem using a commercial software/code
The main goal of such practice is the search for machines
those are the best suitable for industrial/research projects
A good practice is to evaluate the performance of the
computer cluster architecture, before its purchase, using
the same application software which a company uses for
their design
13/11/2014 3
HPC Benchmarking
HPC
High-performance computing (HPC) is the use of
parallel processing for running advanced application
programs efficiently, reliably and quickly
HPC benchmarking
Benchmarking is used to measure performance using a
specific indicator
13/11/2014 4
SHARED MEMORY PROCESSING (SMP) Usually same PC with multi-core and
multi-processors
Scalable up to about 64 or 128
processors but costly
Synchronization problem
DISTRIBUTED MEMORY PROCESSING (MPP)
Distributed among physically
different machines of same architecture
Scalable up to larger no. of Processors
Message passing overheads
Types of HPC Architectures
Network
CPU 1 CPU 2 CPU 3
memory memory memory
Symmetric Multi Processor
Network
CPU 1 CPU 2 CPU 3
memory memory memory
Distributed Memory Processing
Platform Description
13/11/2014 6
Head Node (01 node)
CPU Type AMD Opteron 6174, 12-core (Magny-cours, 2.2 GHz)
CPU numbers per node 2
Memory 64
Storage 1 TB
Compute Nodes (10 nodes)
CPU Type
Intel Xeon Westmere X5670 (2.93 GHz clockspeed, Hex-Core)
CPU numbers per node 2
Total cores 10 x 2 x 6 = 120
Memory 48 GB per node
Hard disk 300 GB per node
13/11/2014 7
Input file for
Calculates: u2
Input file for
Calculates: u1
Calculates:
P1 P2
A
u1
u2
u and K ,f ,f ,K,K2121
1 2
K2, f 2
K1, f 1
Sequence Diagram
Head Node
Compute
Node 1
Compute
Node 2
fuK
Platform Description
13/11/2014 8
Cluster Configuration
System interconnect Mellanox 5200 QDR Infiniband
Storage system Panasas 12 file server storage type
Total RAM 480GB
Total cores 120
Cluster TB1.1
Softwares used
13/11/2014 9
Benchmarking Software ANSYS 12.1
Solver Direct solver (dsparse)
Operating System RHEL5
Job scheduler PBS
Monitoring software Ganglia
DISCUSSION ON RESULTS
13/11/2014 10
Problem Description
A static structural analysis of a square bar was
carried out. The problem contains 8 millions DOFs.
One end of the bar is fully constrained while the
other end is subjected to the tensile loading. Number
of cores varies from 24 to 120 cores.
13/11/2014 11
Elapsed Time
13/11/2014 12
162.3
148.1
98.2
74.5
60.4 56.4
0
20
40
60
80
100
120
140
160
180
24 36 48 72 96 120
Time vs cores
Ela
pse
d T
ime
(M
inu
tes)
No. of Cores
Time curve starts getting steady/horizontal from
120 cores shows that further increase of number of
cores will not considerably decrease the
computational time, rather it might increase
Computational Time save (Speedup)
13/11/2014 13
0.00
8.70
39.49
54.11
62.79 65.24
0
10
20
30
40
50
60
70
24 36 48 72 96 120
speedup
No. of Cores
P
erc
en
tage
Tim
e s
ave
Cluster Performance
13/11/2014 14
100.1 100.7
164.6
236.0
338.2
406.9
0
50
100
150
200
250
300
350
400
450
24 36 48 72 96 120
Computational rate G
Flo
ps
No. of Cores
29 % of Peak performance at 120 cores
Problem Description
A steady state Thermal analysis problem was
analyzed
A bar with 2.4 MDOF was subjected to a constant
temperature at one end as shown
Number of cores varying from 6 to 30 cores
13/11/2014 15
Elapsed Time
13/11/2014 16
188
166
138
105
0
20
40
60
80
100
120
140
160
180
200
6 12 24 30
Time vs cores
No. of Cores
Elap
sed
Tim
e (s
eco
nd
s)
Computational Time save (Speedup)
13/11/2014 17
0.00
11.50
26.51
44.09
0
10
20
30
40
50
6 12 24 30
speedup
Perc
enta
ge (
%)
No. of Cores
Cluster Performance
13/11/2014 18
28.98
53.01
98.50
123.08
0
20
40
60
80
100
120
140
6 12 24 30
Computational rate GFl
op
s
No. of Cores
35 % of Peak performance at 30 cores
Conclusion
Computational time save of 65.24% could be
achieved while using 120 cores of TB1.1 blades
comparing to 24 cores while dealing with 8MDOF
problem
Computational time save of 44.09% could be
achieved using 30 cores comparing to cores while
handling a problem involving 2.4 MDOF
The results could be used to compare the
performance of the other HPC systems blades to
select the appropriate HPC system for an industry.
13/11/2014 19
Thanks for your attention
13/11/2014 20
Supplements
Peak Performance of a cluster 120x2.93x4 = 1.4 Tflops
Speedup = (t2-t1)/t1
13/11/2014 21