MSC Nastran Sparse Direct Solvers for Tesla GPUs - GTC 2012developer.download.nvidia.com/.../S0064-GTC2012-MSC-Nastran-Solver… · • MSC Nastran is one of the oldest and most widely

MSC Confidential

MSC Nastran Sparse Direct Solvers for

Tesla GPUs

Cheng Liao

May 16, 2012

MSC Confidential

Background Information

• MSC Nastran is one of the oldest and most widely used FEA

solver codes for Aerospace, Automotive and manufacturing in

general.

• First GPU feature for real symmetric sparse direct solver is

released in 2012.1, 2nd half of 2011.

• Additional GPU features for complex and unsymmetric sparse

direct solvers will be in a 2012 point release. (subject to change)

• GPU rank update kernels are developed by Nvidia Engineers for

MSCLDL and MSCLU in house solvers. (Thank you Nvidia!)

• Unsymmetric, 3M and CAS extensions are done by MSC.

• On-going work.

2

MSC Confidential

Sparse Direct Factorization

3

Original Matrix [A ] Factor Matrix [L][D][L]t

MSC Confidential

Multi-frontal Algorithm

4

In a proper order, do the

following at each node.

Assembly

Pivoting

Block Gaussian Elimination:

Solve Factorization Update (Contribution Block)

From Global Stiffness &

contribution blocks

From Element Stiffness &

contribution blocks or

ttLDLADLA

L

AA

AA

21112122

1

111121

11

2221

11 00

1 2

3 4

6 7

5 8

9

11

10

MSC Confidential

Three Prong Approach for FP Intensive Kernels

• Sys653=0 (default): segmented rank1 factorization and solve, segmented

s/d/c/zgemm update, thresholding 1x1 and 2x2 pivoting, medium segment size,

least memory usage.

• Sys653=1: segmented rank-1 factorization, segmented s/d/c/ztrsm solve,

segmented s/d/c/zgemm update, no pivoting, larger segment size, slightly more

memory usage.

• Sys653=3: s/d/c/zsytrf and s/d/c/zgetrf factorization, s/d/c/ztrsm solve, segmented

s/d/c/zgemm update, supernodal 1x1 and 2x2 pivoting, largest memory usage.

5

sys653=0 sys653=1 sys653=3

MSC Confidential

Superelement Matrix Domain Solver

6

j

bb

j

bi

j

ib

j

iij

KK

KKK

2S

4S3S

1S

j

ib

j

ii

j

bi

j

bb

j

b KKKKK

n

j

j

bb KK1

bbb FXK

Superelement stiffness:

Superelement boundary stiffness:

Global stiffness assembly:

Global equilibrium:

MSC Confidential

Segmented (rankN) Update with Staged DGEMM

7

S1 S2 S1 S2 S1 S2

omp omp omp omp omp omp

Copyback GPU->CPU

OMP update on CPU

Compute on GPU S2 S1

S1 S2

S2 S1

tLDL 211121

tLDLA 21112122

Copy & scale

Rank Size

nfront

icol ifrtf ifrtl

MSC Confidential

C2070/Westmere-EP Update Gflops vs Rank Size

8

0

50

100

150

200

250

300

350

400

450

32 64 96 128 192 256 384 512

cs

cs_3m_cut

cu

cs_cas_3m

rs

ru

rs_cas

nfront=8000

(rank size)

MSC Confidential

C2070/Westmere-EP Update Gflops vs Front Size

9

0

50

100

150

200

250

300

350

400

1k 2k 3k 5k 7k 8k 9k 11k

cs

cs_3m_cut

cu

cs_cas_3m

rs

ru

rs_cas

nrank=384

(front size)

MSC Confidential

DMP+SMP+GPU (943K dof Crankshaft SOL101)

10

0

100

200

300

400

500

600

700

800

1G6C 1C 2G6C 6C

solver only

sys653=0

sys653=1

sys653=3

Westmere3.33ghz 2x6

Tesla C2070/2075 48GB

Lower, Better

MSC Confidential

Large 2.4M dof Linear Static SOL101

11

0

20

40

60

80

100

120

1G4C 4C

sys653=0

sys653=1

sys653=3

Lower, Better Westmere2.93ghz 2x6

Tesla 2x M2070 96GB

MSC Confidential

12

SOL108, NVH

Exterior Acoustics simulation

>5X Speedup with 1 GPU over Serial run

~2x Speedup with 2 GPUs over 8 cores

EMEA Customer Model

Structural & Acoustic mesh

with Infinite elements

DOF: 0.165M 0

2

4

6

8

serial 4c 4c1g 8c (dmp=2) 8c2g (dmp=2)

Exterior Acoustics (3 FREQ1 increments) Speedup over Serial run

Server node: Westmere 2.93 GHz, 2x 6 core

Tesla 2x M2070 GPU, 96GB memory

MSC Confidential

13

SOL108, NVH

Trimmed Car Body Frequency Response

USA Automotive OEM Model

Grids: 1.2M

Shells (CQUAD4): 1.04M

Solids (CTETRA): 0.1M

DOF: 7.47M 0

50

100

150

200

250

serial 4c 4c+1g 8c (dmp=2on 2 nodes)

8c+2g (dmp=2on 2 nodes)

Trimmed Car Body NVH (3 FREQ1 increments; 5 load cases)

Elapsed Time in minutes


Tesla 2x M2070 GPU, 96GB memory 1x

2x

3.7x 3.7x 5.6x

~4X Speedup with 1 GPU over Serial run

1.55x Speedup with 2 GPUs over 8 cores

MSC Confidential

14

SOL108, NVH

Coupled Structural-Acoustics simulation

EMEA Automotive OEM Model

Grids: 0.71M

CTETRAs: 3.8M

DOF: 0.71M 0

20

40

60

80

serial 4c (smp) 4c + 1g 8c (smp) 8c(dmp=2)

8c + 2g(dmp =2)

NVH (FREQ1: limited to 5 frequency increments) Elapsed Time in Minutes

1x

2x

4x

2.5x

3.6x 5.5x


Tesla 2x M2070 GPU, 96GB memory

4X Speedup with 1 GPU over Serial run

~1.55x Speedup with 2 GPUs over 8 cores

MSC Confidential

Conclusions

• GPGPU can give very significant performance boost for direct

solver intensive large jobs, ie. max front > 10000 for real data and

> 5000 for complex data models.

• Multiple GPGPU performance is delivered by SOL108

(embarrassingly parallel) and matrix domain solver in SOL101.

• Nvidia and MSC continue to work together to tune BLAS and

LAPACK kernels for MSCLDL and MSCLU.

• Tree-level parallelism, using GPU memory to cache large sub-tree

computing, etc. would provide future road maps to even greater

performance.

• A number of other Nastran functional areas are candidates for

GPU acceleration.

15

MSC Confidential

Acknowledgement

16

NVIDIA: Steve Harpster, Srinivas Kodiyalam, Stan Posey, Thomas Reed,

Steve Rennich, Peng Wang, Shucai Xiao.

MSC: Sunny Gogar, Joe Griffin, Dan’l Pierce, Peter Schartz, Alex TenEyck,

Paul Vanderwalt, Ted Wertheimer, Gaowen Ye, members of MSC

India QA team.

Documents

MSC Nastran Sparse Direct Solvers for Tesla GPUs - GTC 2012developer.download.nvidia.com/.../S0064-GTC2012-MSC-Nastran-Solver… · • MSC Nastran is one of the oldest and most widely