ACCELERATING COMMERCIAL LINEAR DYNAMIC AND … · ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-PERFORMANCE COMPUTING run in a reasonable

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND

NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-

PERFORMANCE COMPUTING




Vladimir Belsky

Director of Solver Development*

Luis Crivelli

Director of Solver Development*

Matt Dunbar

Chief Architect*

Mikhail Belyi

Development Group Manager*

Michael Wood

Developer*

Cristian Ianculescu

Developer*

Mintae Kim

Developer*

Andrzej Bajer

Developer*

*Dassault Systèmes Simulia Corp.

Geraud Krawezik

Developer, ACCELEWARE, Canada

ABSTRACT

In the last decade, significant R&D resources have been invested to deliver commercially available technologies that meet current and future mechanical engineering industry requirements, both in terms of mechanics and performance. While significant focus has been given to developing robust nonlinear finite element analysis technology, there has also been continued investment in developing advancements for linear dynamic analyses. The research and development efforts have focused on combining advanced linear and nonlinear technology to provide accurate, yet fast modelling of noise and vibration engineering problems. This effort has enabled high-fidelity models to




run in a reasonable time which is vital for virtual prototyping within shortened product design cycles.

While it is very true that model sizes (degrees of freedom) have grown significantly during this period, the complexity of the models has also increased, which has led to a larger number of total iterations within nonlinear implicit analyses, and to a large number of eigenmodes within linear dynamic simulations. An innovative approach has been developed to leverage high-performance computing (HPC) resources to yield reasonable turn-around times for such analyses by taking advantage of massive parallelism without sacrificing any mechanical formulation quality.

The accessibility and affordability of HPC hardware in the past few years has changed the landscape of commercial finite element analysis software usage and applications. This change has come in response to an expressed desire from engineers and designers to run their existing simulations faster, or in many cases to run more realistic jobs. Due to their computational cost and lack of high-performance commercial software, such "high-end" simulations were until recently thought to be only available to academic institutions or government research laboratories which typically developed their own HPC applications. Today, with the advent of affordable multi-core SMP workstations and compute clusters with multi-core nodes and high-speed interconnects equipped with GPGPU accelerators, HPC is now sought after by many engineers for routine FEA. This presents a challenge for commercial FEA software vendors which have to adapt their decades old legacy code to take advantage of state-of-the-art HPC platforms.

Given this background, this paper focuses on how recent developments in HPC have affected the performance of linear dynamic and implicit nonlinear analyses. Two main HPC developments are studied. First, we look into the performance and scalability of the commercially available Abaqus AMS eigenvalue solver, and of the entire frequency response simulation running on multi-core SMP workstations. Advances in the AMS eigenvalue solution procedure and linear dynamic capabilities make the realistic simulation solution suitable for a wide range of vehicle-level noise and vibration simulations.

Next, we will discuss the progress made in relatively new, but very active area of high performance commercial FE software development, which is based on taking advantage of high performance GPGPU accelerators. Efficient adoption of GPGPU in such products is a very challenging task which requires significant re-architecture of the existent code. We describe the experience in integrating GPGPU acceleration into complex commercial engineering software. In particular we discuss the trade-off we had to make and the benefits we obtained from this technology.

KEYWORDS

HPC, Parallel Computing, Cluster Computing, Equation Solver, Non-linear Implicit FEA, GPGPU, Modal Linear Dynamics, AMS, Automated Multilevel Substructuring, Abaqus




1: AMS (Automatic Multilevel Substructuring) Eigensolver

As model meshes become more refined and accurate, the complexity of the

models increase, the size of finite element models grows, all while the demand

for faster job turn-around time continues to be strong. The role of a mode-

based approach in linear dynamic analyses becomes crucial given that the

direct approach, based on the solution of a system of equations on the physical

domain for each excitation frequency, is much more expensive as the size of

finite element models grows. The most time-consuming task in mode-based

linear dynamic analyses is the solution of a large eigenvalue extraction

problem to create the modal basis. The most advanced eigenvalue extraction

technology suitable to handle today’s needs in the automotive noise and

vibration (N&V) simulation is AMLS. Beginning in 2006, SIMULIA began to

offer a version of AMLS, marketed as Abaqus/AMS. The performance of the

AMS eigensolver, therefore, becomes crucially important to reduce overall

analysis runtime in large-scale N&V simulations.

Over the past three years (2007-2010), the Abaqus AMS eigensolver has

evolved from an original serial implementation designed for computers with a

single processor and limited memory able to solve problems with a couple of

million equations, to the modern style software implementation designed for

modern computers with multi-core processors and a large amount of memory

for solving larger problems with tens of millions of equations. Beginning with

the Abaqus 6.10 Enhanced Functionality release, the AMS eigensolver can run

in parallel on shared memory computers with multiple processors. Following

that release, parallel performance of AMS has been improved substantially.

To demonstrate the AMS eigensolver performance on HPC hardware, two

automotive industrial models were chosen to run on the machine with four six-

core Intel Xeon Nehalem processors and 128 GB physical memory.

The first model, referred to as ‘Model 1’, is an automotive vehicle body model

with 14.1 million degrees of freedom. This model has an acoustic cavity for

coupled structural-acoustic frequency response analysis; the modal basis

consists of 5190 structural modes and 266 acoustic modes below the maximum

frequency of 600 Hz. The selective recovery capability for the structural

domain, which recovers user-requested output variables at the user-defined

node set, and the full recovery capability for acoustic domain, which recovers

user-defined output variables at all nodes of the model, is used in this

simulation.

The second model, ‘Model 2’, is a powertrain model with 11.2 million degrees

of freedom. The modal basis includes 377 modes below 2500 Hz, and the

selective recovery capability is used.




The pre-release version of Abaqus 6.11 was used to obtain the performance

data in both models.

Table 1 demonstrates the parallel performance of the AMS eigensolver for

Model 1. In the table, ‘FREQ’ indicates the whole frequency extraction

procedure, which includes the AMS eigensolver and the non-scalable non-

solver parts of the code, while ‘AMS’ indicates the AMS eigensolver itself.

The AMS eigensolver takes only 25 minutes to solve the eigenproblem on 16

cores, while it takes about 4 hours on a single core. Non-scalable parts become

dominant as the number of cores increases. Figure 1 shows the scalability of

the AMS eigensolver based on the data in Table 1. Due to a good parallel

speedup of AMS, the frequency extraction procedure FREQ shows a speedup

of about 5 overall.

Table 1. Performance of the AMS eigensolver (AMS) and frequency extraction

procedure (FREQ) for Model 1

Number of Cores

FREQ (6.11)

Wall Clock Time

(h:mm)

AMS (6.11)

Wall Clock Time

(h:mm)

1 4:32 4:01

4 1:38 1:07

8 1:09 0:39

16 0:56 0:25

ACCELERATING COMMERC

NONLINEAR IMPLICIT F

Figure 1. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency

extraction procedure (FREQ) for Model 1

Table 2 and Figure 2 show the parallel performance and scalability of the

frequency extraction procedure (

Model 2. Due to a good scalability of the AMS eigensolver, the frequency

extraction procedure takes only 36 minutes for this large model, which

significantly reduces the overall job turn

Table 2. Performance of the AMS eigensolver (AMS) and f

(FREQ) for Model 2

Number of Cores

1

4

8

16

1.00

1.00

0

2

4

6

8

10

12

1

Sp

ee

du

p F

act

or

ACCELERATING COMMERCIAL LINEAR DYNAMIC A

NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH

PERFORMANCE COMPUTIN

Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency



frequency extraction procedure (FREQ) and the AMS eigensolver (AMS


extraction procedure takes only 36 minutes for this large model, which

significantly reduces the overall job turn-around time.

Table 2. Performance of the AMS eigensolver (AMS) and frequency extraction procedure

FREQ (6.11)

Wall Clock Time

(h:mm)

AMS (6.11)

Wall Clock Time

(h:mm)

2:57 2:33

1:03 0:39

0:45 0:21

0:36 0:13

2.77

3.93

4.89

3.58

6.21

9.61

4 8 16

Number of Cores

FREQ (6.11)

AMS (6.11)

IAL LINEAR DYNAMIC AND

HIGH-



AMS) for


requency extraction procedure

Wall Clock Time

FREQ (6.11)

AMS (6.11)



Figure 2. Scalability of the Abaqus 6.11 AMS eigensolver (AMS)


2: Mode-based Frequency Response Analysis

Mode-based frequency response analysis is the common

N&V engineers for simulation of

structures. To reduce the cost of the analysis, the system of equations is solved

in a modal subspace. The projection o

subspace requires the eigenvalue extraction analysis

typically performed using the AMS eigensolver described in the previous

section. The projected system of equations in the modal subspace takes the

following form:

(

2K M C D

C D K M

ω ω

ω ω

− − − − − − −

Here: K - is the system stiffness

matrix; D - structural damping matrix

generalized displacement; F

quantity; Im() - imaginary part of a complex quantity

1.00

1.00

0

2

4

6

8

10

12

14

1

Sp

ee

du

p F

act

or




Figure 2. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency


based Frequency Response Analysis

frequency response analysis is the commonly accepted method by

for simulation of noise and vibrations in vehicles and other

. To reduce the cost of the analysis, the system of equations is solved

in a modal subspace. The projection of the finite element system to the modal

subspace requires the eigenvalue extraction analysis, which in Abaqus is

ically performed using the AMS eigensolver described in the previous

. The projected system of equations in the modal subspace takes the

)2

Re( ( )) Re( ( )) (1)

Im( ( )) Im( ( ))

K M C D Q F

Q FC D K M

ω ω

ω ωω ω

− − − = −− − − −

is the system stiffness matrix; M - mass matrix; C - viscous damping

tructural damping matrix; ω - excitation frequency; Q -

; F - force vector; Re() - real part of a complex

maginary part of a complex quantity.

2.79

3.96

4.86

3.87

7.45

11.70

4 8 16

Number of Cores

FREQ (6.11)

AMS (6.11)


HIGH-


ly accepted method by

vehicles and other

. To reduce the cost of the analysis, the system of equations is solved

modal

, which in Abaqus is

ically performed using the AMS eigensolver described in the previous

. The projected system of equations in the modal subspace takes the

(1)

viscous damping

FREQ (6.11)

AMS (6.11)



The size of the modal system

frequency response is performed in the mid

more than 10,000 modes in a complex

applied, the mode-based analysis is really inexpensive be

equations (1) becomes decoupled and every equation is solved separately.

However, in the mid-frequency

material damping (e.g., dashpot elements and

must be applied to obtain accurate results. The material damping causes the

projected damping operators

populated. Thus, the syste

of modes (2N) in the modal subspace must be solved at every frequency point.

With a few hundred to a thousand

over 10,000, it becomes a rather expensive analysis

Figure 3. The structure of the left

response analysis

In a typical case, when the stiffness matrix is symmetric and constant with

respect to excitation frequency, the stiffness and mass operators are red

diagonal matrices in the modal subspace. Thus, the structure of the system of

modal equations (1) in this case is presented in Figure 3. The diagonal blocks

are represented by diagonal matrices (corresponding to a linear combination of

projected mass and stiffness operators), while the off

populated (corresponding to projected structural and viscous damping

operators). Traditionally, this system of equations of the size 2N is solved at

every frequency. First, we take advant

the operator and reduce the size of the system by half. Using this reduction we

end up with a fully populated system of equations of the size M. For details and

derivation of the reduction algorithm we refer to [1].

dominated by the matrix-matrix multiplication operations, and takes more time

than the subsequent solution of the reduced system. Thus, to obtain an efficient

parallel algorithm, we need to parallelize both algorithms: the matrix

multiplication and the factorization of the dense system of equations.




The size of the modal system (1) is twice the number of modes. If the

frequency response is performed in the mid-frequency range, often there are

modes in a complex structure. If only diagonal damping is

based analysis is really inexpensive because the system of

) becomes decoupled and every equation is solved separately.

frequency range the modal damping is not sufficient and

material damping (e.g., dashpot elements and material structural damping)


damping operators C and/or D in the equation (1) to be fully

system of linear equations, which is two times the number


to a thousand frequency points, and the number of modes

mes a rather expensive analysis.

of the left-hand side operator for the mode-based frequency


respect to excitation frequency, the stiffness and mass operators are reduced to




stiffness operators), while the off-diagonal blocks are fully

populated (corresponding to projected structural and viscous damping


every frequency. First, we take advantage of a diagonal structure on a part of



derivation of the reduction algorithm we refer to [1]. The reduction phase is

matrix multiplication operations, and takes more time


parallel algorithm, we need to parallelize both algorithms: the matrix-matrix

multiplication and the factorization of the dense system of equations.


HIGH-


. If the

frequency range, often there are

. If only diagonal damping is

cause the system of

) becomes decoupled and every equation is solved separately.

ient and

structural damping)


) to be fully

is two times the number


ber of modes

based frequency


uced to




diagonal blocks are fully


age of a diagonal structure on a part of



The reduction phase is

matrix multiplication operations, and takes more time


trix




The parallel algorithm for mode-based frequency response analysis is

implemented on shared-memory machines. The computationally expensive

ingredients of this algorithm – matrix-matrix products and dense linear solves –

have been parallelized using a task-based approach. This implementation

ensures that the memory consumption remains constant regardless of the

number of processors used, while achieving almost linear parallel scaling to the

limits of the number of general-purpose computational cores of modern

hardware. To demonstrate the effectiveness of this algorithm we present an

example of a typical N&V analysis of structural vibration of a car body. The

stiffness matrix is symmetric and the model includes some structural damping,

so the projected system looks like the one illustrated in Figure 3. Over 10,000

modes were extracted using Abaqus/AMS eigensolver, and the analysis is

performed at 500 frequency points. The presented results were obtained on the

machine with four six-core Intel Xeon Nehalem processors and 128 GB

physical memory.

Table 3 and Figure 4 show the performance and scalability of the modal

frequency response solver. Excellent parallel speed-up of 20.63 on 24-cores

allows for reducing of the wall-clock analysis time from almost 22 hour to

about 1 hour. It drastically reduces turn-around time and enables N&V

engineers to analyse a few design changes during one business day.

Table 3. Analysis time and scalability of the mode-based frequency response solver

Number of Cores Wall Clock Time

(h:mm) Parallel Speed-Up

1 21:54 1.00

2 11:35 1.89

4 5:48 3.78

8 2:57 7.44

16 1:56 14.24

24 1:04 20.63



Figure 4. Analysis time of the mode

machine with 24 cores

Figure 5 demonstrates the parallel efficiency of the modal frequency response

solver. The efficiency is defined as the parallel speed

of cores times 100%. Thus

optimal speed-up. The presented results demonstrate very good efficiency of

the modal frequency response solver of about 95% on 2, 4, and 8 cores. On the

24 cores, the efficiency is ju

Figure 5. Parallel efficiency of the mode

memory machine with 24 cores

1.890

5

10

15

20

25

2

Pa

rall

el

spe

ed

-up

0

20

40

60

80

100

2

Eff

icie

ncy

[%

]




time of the mode-based frequency response solver on a shared-memory

s the parallel efficiency of the modal frequency response

solver. The efficiency is defined as the parallel speed-up divided by the number

es times 100%. Thus, the parallel efficiency of 100% would indicate the

up. The presented results demonstrate very good efficiency of


24 cores, the efficiency is just below 90%.

Figure 5. Parallel efficiency of the mode-based frequency response solver on a shared

mory machine with 24 cores

3.78

7.44

14.24

20.63

4 8 16 24

Number of cores

4 8 16 24

Number of cores


HIGH-


memory

s the parallel efficiency of the modal frequency response

up divided by the number

the parallel efficiency of 100% would indicate the

up. The presented results demonstrate very good efficiency of


based frequency response solver on a shared-

20.63




3: Acceleration of the direct sparse solver using GPGPUs

GPGPUs offer exceptional floating point operation speed. With the advent of

recent hardware, theoretical double precision floating point operations can be

executed at a rate of 500 GFlops. Of course, in order to realize this peak, an

algorithm must be embarrassingly parallel since the tremendous processing

speed is largely due to massive parallelism of the GPGPU hardware.

One of the challenges to exploiting the power of GPGPU in general purpose

FEA codes is that it requires re-writing the code in a new language, and

adapting the algorithm to maximally utilize the GPGPU hardware. Currently,

there are two GPGPU hardware vendors, and each has its own preferred coding

language.

In order to maximize the benefit of GPGPU performance while minimizing

development effort, we chose to apply this technology to the most floating

point intensive portion of any implicit FEA program – the linear equation

solver. With minimal changes to our existing solver, we created an interface for

the factorization of individual supernodes in our direct sparse solver.

We turned to Acceleware Corporation for the implementation of the GPGPU

portion of the project. Their experience with GPGPU acceleration of scientific

algorithms was helpful in getting our first implementation up and running

quickly.

In our current implementation, our GPGPU accelerated direct solver can

greatly reduce the time spent in the solver phase of an FEA analysis for a

variety of large models. We have learned that there are a number of factors

which must be considered when trying to determine the level of benefit to

expect when adding GPGPU compute capability to reduce analysis time.

Abaqus provides an out-of-core solver, however, when enough memory is

available, the factorization and subsequent backward pass remains in-core and

delivers optimal performance. Once the problem size exceeds the system

memory, I/O costs will become significant and reduces the overall benefit of

GPGPU acceleration. Another factor is the size of the FEA model. The most

important measure of size in this case is not the number of degrees of freedom

(DoF) in the model, but the number of floating point operations required for

factorization. Thus, a 5 million DoF solid element model may be more

computationally intensive than a 10 million DoF shell element model.

The target we set for performance gain was an overall speedup of 2x for the

analysis wall clock time for our benchmark automotive powertrain model,

when compared to the performance on a 4 core parallel run. The actual results

are shown in Figure 6, identified by the number of floating point operations in

the solver for this model (1.0E+13). This chart is arranged to show how the

amount of work in the solver correlates to performance improvements when




using a GPGPU for compute acceleration. The effectiveness of GPGPU

acceleration increases with problem size up to the point where the factorization

can no longer fit in core, or an individual supernode does not fit in the GPGPU

memory.

Figure 6. Effect of GPGPU acceleration on the performance of 4 core parallel runs

Today, it is common for high performance workstations or compute cluster

nodes have 8 cores. For comparison, see the results in the chart of Figure 7 for

8 core + GPGPU vs. 8 core runs for some of the larger test cases. Here, the

addition of GPGPU acceleration is again beneficial, but not to the same level.

Increasing the number of core increases the number of branches in the

supernode tree that are solved concurrently. When more than one branch has an

eligible supernode for processing on the GPGPU, there is contention for the

GPGPU resource. This results in a delay (waiting for GPGPU to be available),

or processing the supernode on the slower CPU resources.

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.3

4E

+1

1

4.4

5E

+1

1

6.5

9E

+1

1

9.9

0E

+1

1

1.9

1E

+1

2

2.1

9E

+1

2

4.3

7E

+1

2

5.7

6E

+1

2

1.0

3E

+1

3

1.6

8E

+1

3

1.7

0E

+1

3

2.6

3E

+1

3

1.0

8E

+1

4

GP

GP

U s

pe

ed

up

(4

co

re /

(4

co

re +

gp

u))




Figure 7. Effect of GPGPU acceleration on the performance of 8 core parallel runs

Future developments to further leverage GPGPU acceleration of our direct

sparse solver will target deployment on multiple nodes of a compute cluster.

Going forward we hope to find applications for GPGPU compute acceleration

outside of our direct sparse solver.

REFERENCES

1. Bajer, A., “Performance Improvement Algorithm for Mode-Based Frequency

Response Analysis,” SAE Paper No. 2009-01-2223, 2009.

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

5.8E+12 1.0E+13 2.6E+13 1.1E+14

Sp

ee

d u

p (

8 c

pu

/ (

8 c

pu

+ g

pu

))

Documents

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND … · ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH-PERFORMANCE COMPUTING run in a reasonable