Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
Vladimir Belsky
Director of Solver Development*
Luis Crivelli
Director of Solver Development*
Matt Dunbar
Chief Architect*
Mikhail Belyi
Development Group Manager*
Michael Wood
Developer*
Cristian Ianculescu
Developer*
Mintae Kim
Developer*
Andrzej Bajer
Developer*
*Dassault Systèmes Simulia Corp.
Geraud Krawezik
Developer, ACCELEWARE, Canada
ABSTRACT
In the last decade, significant R&D resources have been invested to deliver commercially available technologies that meet current and future mechanical engineering industry requirements, both in terms of mechanics and performance. While significant focus has been given to developing robust nonlinear finite element analysis technology, there has also been continued investment in developing advancements for linear dynamic analyses. The research and development efforts have focused on combining advanced linear and nonlinear technology to provide accurate, yet fast modelling of noise and vibration engineering problems. This effort has enabled high-fidelity models to
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
run in a reasonable time which is vital for virtual prototyping within shortened product design cycles.
While it is very true that model sizes (degrees of freedom) have grown significantly during this period, the complexity of the models has also increased, which has led to a larger number of total iterations within nonlinear implicit analyses, and to a large number of eigenmodes within linear dynamic simulations. An innovative approach has been developed to leverage high-performance computing (HPC) resources to yield reasonable turn-around times for such analyses by taking advantage of massive parallelism without sacrificing any mechanical formulation quality.
The accessibility and affordability of HPC hardware in the past few years has changed the landscape of commercial finite element analysis software usage and applications. This change has come in response to an expressed desire from engineers and designers to run their existing simulations faster, or in many cases to run more realistic jobs. Due to their computational cost and lack of high-performance commercial software, such "high-end" simulations were until recently thought to be only available to academic institutions or government research laboratories which typically developed their own HPC applications. Today, with the advent of affordable multi-core SMP workstations and compute clusters with multi-core nodes and high-speed interconnects equipped with GPGPU accelerators, HPC is now sought after by many engineers for routine FEA. This presents a challenge for commercial FEA software vendors which have to adapt their decades old legacy code to take advantage of state-of-the-art HPC platforms.
Given this background, this paper focuses on how recent developments in HPC have affected the performance of linear dynamic and implicit nonlinear analyses. Two main HPC developments are studied. First, we look into the performance and scalability of the commercially available Abaqus AMS eigenvalue solver, and of the entire frequency response simulation running on multi-core SMP workstations. Advances in the AMS eigenvalue solution procedure and linear dynamic capabilities make the realistic simulation solution suitable for a wide range of vehicle-level noise and vibration simulations.
Next, we will discuss the progress made in relatively new, but very active area of high performance commercial FE software development, which is based on taking advantage of high performance GPGPU accelerators. Efficient adoption of GPGPU in such products is a very challenging task which requires significant re-architecture of the existent code. We describe the experience in integrating GPGPU acceleration into complex commercial engineering software. In particular we discuss the trade-off we had to make and the benefits we obtained from this technology.
KEYWORDS
HPC, Parallel Computing, Cluster Computing, Equation Solver, Non-linear Implicit FEA, GPGPU, Modal Linear Dynamics, AMS, Automated Multilevel Substructuring, Abaqus
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
1: AMS (Automatic Multilevel Substructuring) Eigensolver
As model meshes become more refined and accurate, the complexity of the
models increase, the size of finite element models grows, all while the demand
for faster job turn-around time continues to be strong. The role of a mode-
based approach in linear dynamic analyses becomes crucial given that the
direct approach, based on the solution of a system of equations on the physical
domain for each excitation frequency, is much more expensive as the size of
finite element models grows. The most time-consuming task in mode-based
linear dynamic analyses is the solution of a large eigenvalue extraction
problem to create the modal basis. The most advanced eigenvalue extraction
technology suitable to handle today’s needs in the automotive noise and
vibration (N&V) simulation is AMLS. Beginning in 2006, SIMULIA began to
offer a version of AMLS, marketed as Abaqus/AMS. The performance of the
AMS eigensolver, therefore, becomes crucially important to reduce overall
analysis runtime in large-scale N&V simulations.
Over the past three years (2007-2010), the Abaqus AMS eigensolver has
evolved from an original serial implementation designed for computers with a
single processor and limited memory able to solve problems with a couple of
million equations, to the modern style software implementation designed for
modern computers with multi-core processors and a large amount of memory
for solving larger problems with tens of millions of equations. Beginning with
the Abaqus 6.10 Enhanced Functionality release, the AMS eigensolver can run
in parallel on shared memory computers with multiple processors. Following
that release, parallel performance of AMS has been improved substantially.
To demonstrate the AMS eigensolver performance on HPC hardware, two
automotive industrial models were chosen to run on the machine with four six-
core Intel Xeon Nehalem processors and 128 GB physical memory.
The first model, referred to as ‘Model 1’, is an automotive vehicle body model
with 14.1 million degrees of freedom. This model has an acoustic cavity for
coupled structural-acoustic frequency response analysis; the modal basis
consists of 5190 structural modes and 266 acoustic modes below the maximum
frequency of 600 Hz. The selective recovery capability for the structural
domain, which recovers user-requested output variables at the user-defined
node set, and the full recovery capability for acoustic domain, which recovers
user-defined output variables at all nodes of the model, is used in this
simulation.
The second model, ‘Model 2’, is a powertrain model with 11.2 million degrees
of freedom. The modal basis includes 377 modes below 2500 Hz, and the
selective recovery capability is used.
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
The pre-release version of Abaqus 6.11 was used to obtain the performance
data in both models.
Table 1 demonstrates the parallel performance of the AMS eigensolver for
Model 1. In the table, ‘FREQ’ indicates the whole frequency extraction
procedure, which includes the AMS eigensolver and the non-scalable non-
solver parts of the code, while ‘AMS’ indicates the AMS eigensolver itself.
The AMS eigensolver takes only 25 minutes to solve the eigenproblem on 16
cores, while it takes about 4 hours on a single core. Non-scalable parts become
dominant as the number of cores increases. Figure 1 shows the scalability of
the AMS eigensolver based on the data in Table 1. Due to a good parallel
speedup of AMS, the frequency extraction procedure FREQ shows a speedup
of about 5 overall.
Table 1. Performance of the AMS eigensolver (AMS) and frequency extraction
procedure (FREQ) for Model 1
Number of Cores
FREQ (6.11)
Wall Clock Time
(h:mm)
AMS (6.11)
Wall Clock Time
(h:mm)
1 4:32 4:01
4 1:38 1:07
8 1:09 0:39
16 0:56 0:25
ACCELERATING COMMERC
NONLINEAR IMPLICIT F
Figure 1. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency
extraction procedure (FREQ) for Model 1
Table 2 and Figure 2 show the parallel performance and scalability of the
frequency extraction procedure (
Model 2. Due to a good scalability of the AMS eigensolver, the frequency
extraction procedure takes only 36 minutes for this large model, which
significantly reduces the overall job turn
Table 2. Performance of the AMS eigensolver (AMS) and f
(FREQ) for Model 2
Number of Cores
1
4
8
16
1.00
1.00
0
2
4
6
8
10
12
1
Sp
ee
du
p F
act
or
ACCELERATING COMMERCIAL LINEAR DYNAMIC A
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH
PERFORMANCE COMPUTIN
Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency
extraction procedure (FREQ) for Model 1
Table 2 and Figure 2 show the parallel performance and scalability of the
frequency extraction procedure (FREQ) and the AMS eigensolver (AMS
Model 2. Due to a good scalability of the AMS eigensolver, the frequency
extraction procedure takes only 36 minutes for this large model, which
significantly reduces the overall job turn-around time.
Table 2. Performance of the AMS eigensolver (AMS) and frequency extraction procedure
FREQ (6.11)
Wall Clock Time
(h:mm)
AMS (6.11)
Wall Clock Time
(h:mm)
2:57 2:33
1:03 0:39
0:45 0:21
0:36 0:13
2.77
3.93
4.89
3.58
6.21
9.61
4 8 16
Number of Cores
FREQ (6.11)
AMS (6.11)
IAL LINEAR DYNAMIC AND
HIGH-
PERFORMANCE COMPUTING
Table 2 and Figure 2 show the parallel performance and scalability of the
AMS) for
Model 2. Due to a good scalability of the AMS eigensolver, the frequency
requency extraction procedure
Wall Clock Time
FREQ (6.11)
AMS (6.11)
ACCELERATING COMMERC
NONLINEAR IMPLICIT F
Figure 2. Scalability of the Abaqus 6.11 AMS eigensolver (AMS)
extraction procedure (FREQ) for Model 2
2: Mode-based Frequency Response Analysis
Mode-based frequency response analysis is the common
N&V engineers for simulation of
structures. To reduce the cost of the analysis, the system of equations is solved
in a modal subspace. The projection o
subspace requires the eigenvalue extraction analysis
typically performed using the AMS eigensolver described in the previous
section. The projected system of equations in the modal subspace takes the
following form:
(
2K M C D
C D K M
ω ω
ω ω
− − − − − − −
Here: K - is the system stiffness
matrix; D - structural damping matrix
generalized displacement; F
quantity; Im() - imaginary part of a complex quantity
1.00
1.00
0
2
4
6
8
10
12
14
1
Sp
ee
du
p F
act
or
ACCELERATING COMMERCIAL LINEAR DYNAMIC A
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH
PERFORMANCE COMPUTIN
Figure 2. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency
extraction procedure (FREQ) for Model 2
based Frequency Response Analysis
frequency response analysis is the commonly accepted method by
for simulation of noise and vibrations in vehicles and other
. To reduce the cost of the analysis, the system of equations is solved
in a modal subspace. The projection of the finite element system to the modal
subspace requires the eigenvalue extraction analysis, which in Abaqus is
ically performed using the AMS eigensolver described in the previous
. The projected system of equations in the modal subspace takes the
)2
Re( ( )) Re( ( )) (1)
Im( ( )) Im( ( ))
K M C D Q F
Q FC D K M
ω ω
ω ωω ω
− − − = −− − − −
is the system stiffness matrix; M - mass matrix; C - viscous damping
tructural damping matrix; ω - excitation frequency; Q -
; F - force vector; Re() - real part of a complex
maginary part of a complex quantity.
2.79
3.96
4.86
3.87
7.45
11.70
4 8 16
Number of Cores
FREQ (6.11)
AMS (6.11)
IAL LINEAR DYNAMIC AND
HIGH-
PERFORMANCE COMPUTING
ly accepted method by
vehicles and other
. To reduce the cost of the analysis, the system of equations is solved
modal
, which in Abaqus is
ically performed using the AMS eigensolver described in the previous
. The projected system of equations in the modal subspace takes the
(1)
viscous damping
FREQ (6.11)
AMS (6.11)
ACCELERATING COMMERC
NONLINEAR IMPLICIT F
The size of the modal system
frequency response is performed in the mid
more than 10,000 modes in a complex
applied, the mode-based analysis is really inexpensive be
equations (1) becomes decoupled and every equation is solved separately.
However, in the mid-frequency
material damping (e.g., dashpot elements and
must be applied to obtain accurate results. The material damping causes the
projected damping operators
populated. Thus, the syste
of modes (2N) in the modal subspace must be solved at every frequency point.
With a few hundred to a thousand
over 10,000, it becomes a rather expensive analysis
Figure 3. The structure of the left
response analysis
In a typical case, when the stiffness matrix is symmetric and constant with
respect to excitation frequency, the stiffness and mass operators are red
diagonal matrices in the modal subspace. Thus, the structure of the system of
modal equations (1) in this case is presented in Figure 3. The diagonal blocks
are represented by diagonal matrices (corresponding to a linear combination of
projected mass and stiffness operators), while the off
populated (corresponding to projected structural and viscous damping
operators). Traditionally, this system of equations of the size 2N is solved at
every frequency. First, we take advant
the operator and reduce the size of the system by half. Using this reduction we
end up with a fully populated system of equations of the size M. For details and
derivation of the reduction algorithm we refer to [1].
dominated by the matrix-matrix multiplication operations, and takes more time
than the subsequent solution of the reduced system. Thus, to obtain an efficient
parallel algorithm, we need to parallelize both algorithms: the matrix
multiplication and the factorization of the dense system of equations.
ACCELERATING COMMERCIAL LINEAR DYNAMIC A
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH
PERFORMANCE COMPUTIN
The size of the modal system (1) is twice the number of modes. If the
frequency response is performed in the mid-frequency range, often there are
modes in a complex structure. If only diagonal damping is
based analysis is really inexpensive because the system of
) becomes decoupled and every equation is solved separately.
frequency range the modal damping is not sufficient and
material damping (e.g., dashpot elements and material structural damping)
must be applied to obtain accurate results. The material damping causes the
damping operators C and/or D in the equation (1) to be fully
system of linear equations, which is two times the number
of modes (2N) in the modal subspace must be solved at every frequency point.
to a thousand frequency points, and the number of modes
mes a rather expensive analysis.
of the left-hand side operator for the mode-based frequency
In a typical case, when the stiffness matrix is symmetric and constant with
respect to excitation frequency, the stiffness and mass operators are reduced to
diagonal matrices in the modal subspace. Thus, the structure of the system of
modal equations (1) in this case is presented in Figure 3. The diagonal blocks
are represented by diagonal matrices (corresponding to a linear combination of
stiffness operators), while the off-diagonal blocks are fully
populated (corresponding to projected structural and viscous damping
operators). Traditionally, this system of equations of the size 2N is solved at
every frequency. First, we take advantage of a diagonal structure on a part of
the operator and reduce the size of the system by half. Using this reduction we
end up with a fully populated system of equations of the size M. For details and
derivation of the reduction algorithm we refer to [1]. The reduction phase is
matrix multiplication operations, and takes more time
than the subsequent solution of the reduced system. Thus, to obtain an efficient
parallel algorithm, we need to parallelize both algorithms: the matrix-matrix
multiplication and the factorization of the dense system of equations.
IAL LINEAR DYNAMIC AND
HIGH-
PERFORMANCE COMPUTING
. If the
frequency range, often there are
. If only diagonal damping is
cause the system of
) becomes decoupled and every equation is solved separately.
ient and
structural damping)
must be applied to obtain accurate results. The material damping causes the
) to be fully
is two times the number
of modes (2N) in the modal subspace must be solved at every frequency point.
ber of modes
based frequency
In a typical case, when the stiffness matrix is symmetric and constant with
uced to
diagonal matrices in the modal subspace. Thus, the structure of the system of
modal equations (1) in this case is presented in Figure 3. The diagonal blocks
are represented by diagonal matrices (corresponding to a linear combination of
diagonal blocks are fully
operators). Traditionally, this system of equations of the size 2N is solved at
age of a diagonal structure on a part of
the operator and reduce the size of the system by half. Using this reduction we
end up with a fully populated system of equations of the size M. For details and
The reduction phase is
matrix multiplication operations, and takes more time
than the subsequent solution of the reduced system. Thus, to obtain an efficient
trix
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
The parallel algorithm for mode-based frequency response analysis is
implemented on shared-memory machines. The computationally expensive
ingredients of this algorithm – matrix-matrix products and dense linear solves –
have been parallelized using a task-based approach. This implementation
ensures that the memory consumption remains constant regardless of the
number of processors used, while achieving almost linear parallel scaling to the
limits of the number of general-purpose computational cores of modern
hardware. To demonstrate the effectiveness of this algorithm we present an
example of a typical N&V analysis of structural vibration of a car body. The
stiffness matrix is symmetric and the model includes some structural damping,
so the projected system looks like the one illustrated in Figure 3. Over 10,000
modes were extracted using Abaqus/AMS eigensolver, and the analysis is
performed at 500 frequency points. The presented results were obtained on the
machine with four six-core Intel Xeon Nehalem processors and 128 GB
physical memory.
Table 3 and Figure 4 show the performance and scalability of the modal
frequency response solver. Excellent parallel speed-up of 20.63 on 24-cores
allows for reducing of the wall-clock analysis time from almost 22 hour to
about 1 hour. It drastically reduces turn-around time and enables N&V
engineers to analyse a few design changes during one business day.
Table 3. Analysis time and scalability of the mode-based frequency response solver
Number of Cores Wall Clock Time
(h:mm) Parallel Speed-Up
1 21:54 1.00
2 11:35 1.89
4 5:48 3.78
8 2:57 7.44
16 1:56 14.24
24 1:04 20.63
ACCELERATING COMMERC
NONLINEAR IMPLICIT F
Figure 4. Analysis time of the mode
machine with 24 cores
Figure 5 demonstrates the parallel efficiency of the modal frequency response
solver. The efficiency is defined as the parallel speed
of cores times 100%. Thus
optimal speed-up. The presented results demonstrate very good efficiency of
the modal frequency response solver of about 95% on 2, 4, and 8 cores. On the
24 cores, the efficiency is ju
Figure 5. Parallel efficiency of the mode
memory machine with 24 cores
1.890
5
10
15
20
25
2
Pa
rall
el
spe
ed
-up
0
20
40
60
80
100
2
Eff
icie
ncy
[%
]
ACCELERATING COMMERCIAL LINEAR DYNAMIC A
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH
PERFORMANCE COMPUTIN
time of the mode-based frequency response solver on a shared-memory
s the parallel efficiency of the modal frequency response
solver. The efficiency is defined as the parallel speed-up divided by the number
es times 100%. Thus, the parallel efficiency of 100% would indicate the
up. The presented results demonstrate very good efficiency of
the modal frequency response solver of about 95% on 2, 4, and 8 cores. On the
24 cores, the efficiency is just below 90%.
Figure 5. Parallel efficiency of the mode-based frequency response solver on a shared
mory machine with 24 cores
3.78
7.44
14.24
20.63
4 8 16 24
Number of cores
4 8 16 24
Number of cores
IAL LINEAR DYNAMIC AND
HIGH-
PERFORMANCE COMPUTING
memory
s the parallel efficiency of the modal frequency response
up divided by the number
the parallel efficiency of 100% would indicate the
up. The presented results demonstrate very good efficiency of
the modal frequency response solver of about 95% on 2, 4, and 8 cores. On the
based frequency response solver on a shared-
20.63
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
3: Acceleration of the direct sparse solver using GPGPUs
GPGPUs offer exceptional floating point operation speed. With the advent of
recent hardware, theoretical double precision floating point operations can be
executed at a rate of 500 GFlops. Of course, in order to realize this peak, an
algorithm must be embarrassingly parallel since the tremendous processing
speed is largely due to massive parallelism of the GPGPU hardware.
One of the challenges to exploiting the power of GPGPU in general purpose
FEA codes is that it requires re-writing the code in a new language, and
adapting the algorithm to maximally utilize the GPGPU hardware. Currently,
there are two GPGPU hardware vendors, and each has its own preferred coding
language.
In order to maximize the benefit of GPGPU performance while minimizing
development effort, we chose to apply this technology to the most floating
point intensive portion of any implicit FEA program – the linear equation
solver. With minimal changes to our existing solver, we created an interface for
the factorization of individual supernodes in our direct sparse solver.
We turned to Acceleware Corporation for the implementation of the GPGPU
portion of the project. Their experience with GPGPU acceleration of scientific
algorithms was helpful in getting our first implementation up and running
quickly.
In our current implementation, our GPGPU accelerated direct solver can
greatly reduce the time spent in the solver phase of an FEA analysis for a
variety of large models. We have learned that there are a number of factors
which must be considered when trying to determine the level of benefit to
expect when adding GPGPU compute capability to reduce analysis time.
Abaqus provides an out-of-core solver, however, when enough memory is
available, the factorization and subsequent backward pass remains in-core and
delivers optimal performance. Once the problem size exceeds the system
memory, I/O costs will become significant and reduces the overall benefit of
GPGPU acceleration. Another factor is the size of the FEA model. The most
important measure of size in this case is not the number of degrees of freedom
(DoF) in the model, but the number of floating point operations required for
factorization. Thus, a 5 million DoF solid element model may be more
computationally intensive than a 10 million DoF shell element model.
The target we set for performance gain was an overall speedup of 2x for the
analysis wall clock time for our benchmark automotive powertrain model,
when compared to the performance on a 4 core parallel run. The actual results
are shown in Figure 6, identified by the number of floating point operations in
the solver for this model (1.0E+13). This chart is arranged to show how the
amount of work in the solver correlates to performance improvements when
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
using a GPGPU for compute acceleration. The effectiveness of GPGPU
acceleration increases with problem size up to the point where the factorization
can no longer fit in core, or an individual supernode does not fit in the GPGPU
memory.
Figure 6. Effect of GPGPU acceleration on the performance of 4 core parallel runs
Today, it is common for high performance workstations or compute cluster
nodes have 8 cores. For comparison, see the results in the chart of Figure 7 for
8 core + GPGPU vs. 8 core runs for some of the larger test cases. Here, the
addition of GPGPU acceleration is again beneficial, but not to the same level.
Increasing the number of core increases the number of branches in the
supernode tree that are solved concurrently. When more than one branch has an
eligible supernode for processing on the GPGPU, there is contention for the
GPGPU resource. This results in a delay (waiting for GPGPU to be available),
or processing the supernode on the slower CPU resources.
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.3
4E
+1
1
4.4
5E
+1
1
6.5
9E
+1
1
9.9
0E
+1
1
1.9
1E
+1
2
2.1
9E
+1
2
4.3
7E
+1
2
5.7
6E
+1
2
1.0
3E
+1
3
1.6
8E
+1
3
1.7
0E
+1
3
2.6
3E
+1
3
1.0
8E
+1
4
GP
GP
U s
pe
ed
up
(4
co
re /
(4
co
re +
gp
u))
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND
NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-
PERFORMANCE COMPUTING
Figure 7. Effect of GPGPU acceleration on the performance of 8 core parallel runs
Future developments to further leverage GPGPU acceleration of our direct
sparse solver will target deployment on multiple nodes of a compute cluster.
Going forward we hope to find applications for GPGPU compute acceleration
outside of our direct sparse solver.
REFERENCES
1. Bajer, A., “Performance Improvement Algorithm for Mode-Based Frequency
Response Analysis,” SAE Paper No. 2009-01-2223, 2009.
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
5.8E+12 1.0E+13 2.6E+13 1.1E+14
Sp
ee
d u
p (
8 c
pu
/ (
8 c
pu
+ g
pu
))