Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
MSC Confidential
MSC Nastran Sparse Direct Solvers for
Tesla GPUs
Cheng Liao
May 16, 2012
MSC Confidential
Background Information
• MSC Nastran is one of the oldest and most widely used FEA
solver codes for Aerospace, Automotive and manufacturing in
general.
• First GPU feature for real symmetric sparse direct solver is
released in 2012.1, 2nd half of 2011.
• Additional GPU features for complex and unsymmetric sparse
direct solvers will be in a 2012 point release. (subject to change)
• GPU rank update kernels are developed by Nvidia Engineers for
MSCLDL and MSCLU in house solvers. (Thank you Nvidia!)
• Unsymmetric, 3M and CAS extensions are done by MSC.
• On-going work.
2
MSC Confidential
Sparse Direct Factorization
3
Original Matrix [A ] Factor Matrix [L][D][L]t
MSC Confidential
Multi-frontal Algorithm
4
In a proper order, do the
following at each node.
Assembly
Pivoting
Block Gaussian Elimination:
Solve Factorization Update (Contribution Block)
From Global Stiffness &
contribution blocks
From Element Stiffness &
contribution blocks or
ttLDLADLA
L
AA
AA
21112122
1
111121
11
2221
11 00
1 2
3 4
6 7
5 8
9
11
10
MSC Confidential
Three Prong Approach for FP Intensive Kernels
• Sys653=0 (default): segmented rank1 factorization and solve, segmented
s/d/c/zgemm update, thresholding 1x1 and 2x2 pivoting, medium segment size,
least memory usage.
• Sys653=1: segmented rank-1 factorization, segmented s/d/c/ztrsm solve,
segmented s/d/c/zgemm update, no pivoting, larger segment size, slightly more
memory usage.
• Sys653=3: s/d/c/zsytrf and s/d/c/zgetrf factorization, s/d/c/ztrsm solve, segmented
s/d/c/zgemm update, supernodal 1x1 and 2x2 pivoting, largest memory usage.
5
sys653=0 sys653=1 sys653=3
MSC Confidential
Superelement Matrix Domain Solver
6
j
bb
j
bi
j
ib
j
iij
KK
KKK
2S
4S3S
1S
j
ib
j
ii
j
bi
j
bb
j
b KKKKK
n
j
j
bb KK1
bbb FXK
Superelement stiffness:
Superelement boundary stiffness:
Global stiffness assembly:
Global equilibrium:
MSC Confidential
Segmented (rankN) Update with Staged DGEMM
7
S1 S2 S1 S2 S1 S2
omp omp omp omp omp omp
Copyback GPU->CPU
OMP update on CPU
Compute on GPU S2 S1
S1 S2
S2 S1
tLDL 211121
tLDLA 21112122
Copy & scale
Rank Size
nfront
icol ifrtf ifrtl
MSC Confidential
C2070/Westmere-EP Update Gflops vs Rank Size
8
0
50
100
150
200
250
300
350
400
450
32 64 96 128 192 256 384 512
cs
cs_3m_cut
cu
cs_cas_3m
rs
ru
rs_cas
nfront=8000
(rank size)
MSC Confidential
C2070/Westmere-EP Update Gflops vs Front Size
9
0
50
100
150
200
250
300
350
400
1k 2k 3k 5k 7k 8k 9k 11k
cs
cs_3m_cut
cu
cs_cas_3m
rs
ru
rs_cas
nrank=384
(front size)
MSC Confidential
DMP+SMP+GPU (943K dof Crankshaft SOL101)
10
0
100
200
300
400
500
600
700
800
1G6C 1C 2G6C 6C
solver only
sys653=0
sys653=1
sys653=3
Westmere3.33ghz 2x6
Tesla C2070/2075 48GB
Lower, Better
MSC Confidential
Large 2.4M dof Linear Static SOL101
11
0
20
40
60
80
100
120
1G4C 4C
sys653=0
sys653=1
sys653=3
Lower, Better Westmere2.93ghz 2x6
Tesla 2x M2070 96GB
MSC Confidential
12
SOL108, NVH
Exterior Acoustics simulation
>5X Speedup with 1 GPU over Serial run
~2x Speedup with 2 GPUs over 8 cores
EMEA Customer Model
Structural & Acoustic mesh
with Infinite elements
DOF: 0.165M 0
2
4
6
8
serial 4c 4c1g 8c (dmp=2) 8c2g (dmp=2)
Exterior Acoustics (3 FREQ1 increments) Speedup over Serial run
Server node: Westmere 2.93 GHz, 2x 6 core
Tesla 2x M2070 GPU, 96GB memory
MSC Confidential
13
SOL108, NVH
Trimmed Car Body Frequency Response
USA Automotive OEM Model
Grids: 1.2M
Shells (CQUAD4): 1.04M
Solids (CTETRA): 0.1M
DOF: 7.47M 0
50
100
150
200
250
serial 4c 4c+1g 8c (dmp=2on 2 nodes)
8c+2g (dmp=2on 2 nodes)
Trimmed Car Body NVH (3 FREQ1 increments; 5 load cases)
Elapsed Time in minutes
Server node: Westmere 2.93 GHz, 2x 6 core
Tesla 2x M2070 GPU, 96GB memory 1x
2x
3.7x 3.7x 5.6x
~4X Speedup with 1 GPU over Serial run
1.55x Speedup with 2 GPUs over 8 cores
MSC Confidential
14
SOL108, NVH
Coupled Structural-Acoustics simulation
EMEA Automotive OEM Model
Grids: 0.71M
CTETRAs: 3.8M
DOF: 0.71M 0
20
40
60
80
serial 4c (smp) 4c + 1g 8c (smp) 8c(dmp=2)
8c + 2g(dmp =2)
NVH (FREQ1: limited to 5 frequency increments) Elapsed Time in Minutes
1x
2x
4x
2.5x
3.6x 5.5x
Server node: Westmere 2.93 GHz, 2x 6 core
Tesla 2x M2070 GPU, 96GB memory
4X Speedup with 1 GPU over Serial run
~1.55x Speedup with 2 GPUs over 8 cores
MSC Confidential
Conclusions
• GPGPU can give very significant performance boost for direct
solver intensive large jobs, ie. max front > 10000 for real data and
> 5000 for complex data models.
• Multiple GPGPU performance is delivered by SOL108
(embarrassingly parallel) and matrix domain solver in SOL101.
• Nvidia and MSC continue to work together to tune BLAS and
LAPACK kernels for MSCLDL and MSCLU.
• Tree-level parallelism, using GPU memory to cache large sub-tree
computing, etc. would provide future road maps to even greater
performance.
• A number of other Nastran functional areas are candidates for
GPU acceleration.
15
MSC Confidential
Acknowledgement
16
NVIDIA: Steve Harpster, Srinivas Kodiyalam, Stan Posey, Thomas Reed,
Steve Rennich, Peng Wang, Shucai Xiao.
MSC: Sunny Gogar, Joe Griffin, Dan’l Pierce, Peter Schartz, Alex TenEyck,
Paul Vanderwalt, Ted Wertheimer, Gaowen Ye, members of MSC
India QA team.