Upload
abigayle-ball
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Stan PoseyNVIDIA, Santa Clara, CA, USA; [email protected]
GPU Progress of CAE Applications
2
Development of professional GPUs as co-processing accelerators for x86 CPUs
Strategic AlliancesBusiness and technical collaboration with ISVs; Industry customers; Research organizations
Applications EngineeringTechnical collaboration with ISVs (ANSYS, etc.) for development of GPU-accelerated solvers
Software DevelopmentNVIDIA linear solver toolkit (implicit iterative solvers) , CUDA libraries, GPU compilers
GPU System IntegrationHP, Dell, IBM, Cray, SGI, Fujitsu, others; Kepler K20 based-systems available since 2012
NVIDIA HPC Technology and CAE Strategy Technology
Strategy
3
NVIDIA Kepler family GPUs for CAE simulations
K20 (5 GB)K20X (6 GB)K40 (12 GB) K6000 (12 GB)
GPU Product Summary for CAE Applications
4
Parallel Computing
NVIDIA® MAXIMUS
Visual Computing
CAE Workstations Now Configure with 2 GPUs
CAD
Operations
Pre-
processing
Post-
processing
FEA
CFD
CEM
Intelligent GPU job Allocation
Unified Driver for Quadro + Tesla
ANSYS Certifications
HP, Dell, Xenon, others
Now Kepler-based GPUs
Available Since November 2011
5
TITAN at ORNL 20+ PetaFlops18,688 NVIDIA Tesla K20x
NVIDIA GPUs Accelerate CAE at Any Scale
Same GPU Technology
from
MAXIMUS Workstations
to
TITAN — #2 at
Top500.orgMAXIMUS Workstation
Key Application S3D for Turbulent CombustionHow to efficiently burn next gen diesel and bio fuels?
6
NVIDIA Use of CAE in Product Engineering
ANSYS Icepak – active and passive cooling of IC packages
ANSYS Mechanical – large deflection bending of PCBs
ANSYS Mechanical – comfort and fit of 3D emitter glasses
ANSYS Mechanical – shock & vib of solder ball assemblies
7
Higher fidelity (better models)GPUs permit higher fidelity for existing (CPU-only) job times
Parameter sensitivities (more models)GPUs increase throughput for existing (CPU-only) job capacity, and at lower cost
Advanced techniquesGPUs make practical: high order methods, time dependent vs. static, use of 3D solid finite elements vs. 2D shells, etc.
Larger ISV software budgetsGPUs provide more use of existing ISV software investment
CAE Trends and GPU Acceleration Benefits
8
Strong GPU investments by commercial CAE vendors (ISVs)GPU adoption led by implicit FEA and CEM, followed by CFD
Recent CFD breakthroughs in linear solvers (AMG) and preconditioners
GPUs now production-HPC for leading CAE end-user sitesLed by automotive, electronics, and aerospace industries
GPUs contributing to fast growth in emerging CAE applicationsNew developments in particle-based CFD (LBM, SPH, DEM, etc.)Rapid growth for range of CEM applications and GPU adoption
Progress Summary for GPU-Parallel CAE
9
Available Today
Product Evaluation
Research Evaluation
GPU Status Structural Mechanics Fluid Dynamics ElectromagneticsANSYS MechanicalAbaqus/StandardMSC NastranMarcAFEANX NastranHyperWorks OptiStructPAM-CRASH implicitLS-DYNA implicit
ANSYS CFD (FLUENT)MoldflowCulises (OpenFOAM)Particleworks SpeedIT (OpenFOAM) AcuSolve
Abaqus/CFDLS-DYNA CFD
CFD++FloEFDSTAR-CCM+XFlow
LS-DYNAAbaqus/ExplicitRADIOSS PAM-CRASH
EMProCST MWSXFdtdSEMCAD XFEKONexxim
JMAGCFD-ACE+
GPU Progress – Commercial CAE Software
Xpatch
HFSS
10
Additional Commercial GPU DevelopmentsISV Domain Location Primary
ApplicationsFluiDyna CFD Germany Culises for OpenFOAM;
LBultra
Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL
Prometech CFD Japan Particleworks
Turbostream CFD England, UK Turbostream
IMPETUS Explicit FEA Sweden AFEA
AVL CFD Austria FIRE
CoreTech CFD (molding) Taiwan Moldex3D
Intes Implicit FEA Germany PERMAS
Next Limit CFD Spain XFlow
CPFD CFD USA BARRACUDA
Convergent/IDAJ CFD USA Converge CFD
SCSK Implicit FEA Japan ADVENTURECluster
CDH Implicit FEA Germany AMLS; FastFRS
FunctionBay MB Dynamics S. Korea RecurDyn
Cradle Software CFD Japan SC/Tetra; scSTREAM
11
Every primary ISV has products available on GPUs or ongoing evaluation
The 4 largest ISVs all have products based on GPUs, some at 3rd generationANSYS SIMULIA MSC Software Altair
The top 4 out of 5 ISV applications are available on GPUs todayANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only
Several new ISVs were founded with GPUs as a primary competitive strategyPrometech, FluiDyna, Vratis, IMPETUS, Turbostream
Availability of commercial CEM software expanding with ECAE growthCST, Remcom, Agilent, EMSS on 3rd-gen; JSOL to release JMAG, ANSYS to release HFSS
Status Summary of ISVs and GPU Acceleration
12
CAE Software Focus on Sparse Solvers
CAE Application Software
+
GPU CPU- Hand-CUDA
Parallel- GPU Libraries,
CUBLAS- OpenACC
Directives
Implicit SparseMatrix Operations
40% - 75% ofProfile time, Small % LoC
(Investigating OpenACCfor more tasks on GPU)
Read input, matrix Set-up
Global solution, write output
Implicit SparseMatrix Operations
13
Most time consumed in dense matrix operations such as Cholesky factorization, Schur complement, and others
Method decomposes global stiffness matrix into tree of dense matrix fronts
Most CSM implementations send dense operations to GPU while keeping the assembly tree transversal on the CPU
GPU Approach of Direct Solvers for Implicit CSM
14
Lower threshold:
Fronts too small to overcome
PCIe data transfer costs stay on CPU
cores
Large dense matrix fronts factored on GPU
Small dense matrix fronts factored in parallel on CPU – more cores means higher performance
Typical implicit CSM deployment of multi-frontal sparse direct solvers
Schematic Representation of the Stiffness Matrix that is
Factorized by the Direct Solver
GPU Approach of Direct Solvers for Implicit CSM
15
CAE Priority for ISV Software on GPUs
ANSYS / ANSYS Fluent
OpenFOAM (Various ISVs)
CD-adapco / STAR-CCM+
Autodesk Simulation CFD
ESI / CFD-ACE+
SIMULIA / Abaqus/CFD
ANSYS / ANSYS Mechanical
SIMULIA / Abaqus/Standard
MSC Software / MSC Nastran
MSC Software / Marc
LSTC / LS-DYNA implicit
Altair / RADIOSS Bulk
Siemens / NX Nastran
Autodesk / Mechanical
LSTC / LS-DYNA
SIMULIA / Abaqus/Explicit
Altair / RADIOSS
ESI / PAM-CRASH
ANSYS / ANSYS Mechanical
Altair / RADIOSS
Altair / AcuSolve (CFD)
Autodesk / Moldflow
#1
#2
#4
#3
16
Basics of GPU Computing for ISV Software
ISV software use of GPU acceleration is user-transparent
Jobs launch and complete without additional user steps
User informs ISV application (GUI, command) that a GPU exists
Schematic of a CPU with an attached GPU acceleratorCPU begins/ends job, GPU manages heavy computations
Schematic of an x86 CPU with a GPU accelerator1. ISV job launched on CPU
2. Solver operations sent to GPU3. GPU sends results back to CPU4. ISV job completes on CPU
GD
DR
GD
DR
DDR
DDR
GPUI/OHub PCI-
Express
CPU
Cach
e
1
4
2
3
17
Computational Fluid Dynamics
ANSYS Fluent
18
ANSYS and NVIDIA Collaboration Roadmap
Release
ANSYS Mechanical
ANSYS Fluent ANSYS EM
13.0Dec 2010
SMP, Single GPU, Sparse and PCG/JCG
Solvers
ANSYS Nexxim
14.0Dec 2011
+ Distributed ANSYS;+ Multi-node Support
Radiation Heat Transfer (beta)
ANSYS Nexxim
14.5Nov 2012
+ Multi-GPU Support;+ Hybrid PCG;
+ Kepler GPU Support
+ Radiation HT;+ GPU AMG Solver (beta), Single GPU
ANSYS Nexxim
15.0Dec 2013
+ CUDA 5 Kepler Tuning
+ Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS NexximANSYS HFSS (Transient)
19
ANSYS 15.0 HPC License Scheme for GPUs Treats each GPU socket as a CPU core, which significantly
increases simulation productivity of your HPC licenses
Needs 1 HPC task to enable a GPU
All ANSYS HPC products unlock GPUs in 15.0, including HPC, HPC Pack, HPC Workgroup, and HPC Enterprise products.
20
Solve Linear System of Equations: Ax = b
Assemble Linear System of Equations
No YesStop
Accelerate this first
~ 35%
~ 65%
Runtime:
Non-linear iterations
Converged ?
ANSYS Fluent Profile for Coupled PBNS Solver
21
Overview of AmgX Linear Solver LibraryTwo forms of AMG
Classical AMG, as in HYPRE, strong convergence, scalar
Un-smoothed Aggregation AMG, lower setup times, handles block systems
Krylov methods GMRES, CG, BiCGStab, preconditioned and ‘flexible’ variants
Classic iterative methodsBlock-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1
Multi-colored versions for fine-grained parallelism
Flexible configurationAll methods as solvers, preconditioners, or smoothers; nesting
Designed for non-linear problemsAllows for frequently changing matrix, parallel and efficient setup
22
No CUDA experience necessary to use the libraryC API: links with C, C++ or FortranSmall API, focused Reads common matrix formats (CSR, COO, MM)Single GPU, Multi-GPUInteroperates easily with MPI, OpenMP, and Hybrid parallel applicationsTuned for K20 & K40; supports Fermi and newerSingle, Double precisionSupported on Linux, Win64
AmgX Developed for Ease-of-Use
23
How to Enable NVIDIA GPUs in ANSYS FluentWindows: Linux:
fluent 3ddp -g -ssh –t2 -gpgpu=1 -i journal.jou
Cluster specification:
nprocs = Total number of fluent processes
M = Number of machines
ngpgpus = Number of GPUs per machine
Requirement 1
nprocs mod M = 0
Same number of solver processes on each machine
Requirement 2
mod ngpgpus = 0
No. of processes should be an integer multiple of GPUs
24
Considerations for ANSYS Fluent on GPUs• GPUs accelerate the AMG solver of the CFD analysis
– Fine meshes and low-dissipation problems have high %AMG– Coupled solution scheme spends 65% on average in AMG
• In many cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)
• The system matrix must fit in the GPU memory– For coupled PBNS, each 1 MM cells need about 4 GB of GPU memory– High-memory GPUs such as Tesla K40 or Quadro K6000 are ideal
• Better performance with use of lower CPU core counts– A ratio of 4 CPU cores to 1 GPU is recommended
25
ANSYS Fluent GPU Performance for Large Cases
• External aerodynamics• Steady, k-e turbulence• Double-precision solver• CPU: Intel Xeon E5-2667;
12 cores per node• GPU: Tesla K40, 4 per node
Truck Body Model
14 million cells
13
9.5
111 million cells
36
18
144 CPU cores
144 CPU cores + 48 GPUs
1.4 X2 X
Lower
is Bette
r
36 CPU cores
36 CPU cores + 12 GPUs
AN
SY
S F
luen
t T
ime (
Sec)
NOTE: Reported times are sec per
iteration
ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013
26
• 111M mixed cells• External aerodynamics• Steady, k-e turbulence• Double-precision solver• CPU: Intel Xeon E5-2667;
12 cores per node• GPU: Tesla K40, 4 per node
Truck Body Model144 CPU cores – Amg
48 GPUs – AmgX
AMG solver time per iteration (secs)
29
11
Fluent solution time per iteration (secs)
36
18
144 CPU cores
144 CPU cores + 48 GPUs
2.7 X
2 X
Lower
is Bette
r
80% AMG solver time
NOTE: AmgX is a linear solver toolkit from
NVIDIA, used by ANSYS
ANSYS Fluent GPU Performance for Large Cases ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013
27
Series10
5
10
15
20
25
16 16
2 x Nodes x 2 CPUs (32 Cores Total)
8 GPUs (4 each Node)
14 M Mixed cellsSteady, k-e turbulenceCoupled PBNS, DPTotal solution timesCPU: AMG F-cycleGPU: FGMRES with AMG Preconditioner
Truck Body Model
4 x Nodes x 2 CPUs(64 Cores Total)
ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013
NOTE: All results
fully converged
ANSYS Fluent GPU Study on Productivity Gains
Higheris
Better
AN
SY
S F
luen
t N
um
ber
of
Job
s
Per
Day 32
Cores+ 8
GPUs
64 Cores
• Same solution times: 64 cores vs. 32 cores + 8 GPUs
• Frees up 32 CPUs and HPC licenses for additional job(s)
• Approximate 56% increase in overall productivity for 25% increase in cost
28
Computational Fluid Dynamics
OpenFOAM
29
Provide technical support for commercial GPU solver developments
FluiDyna Culises library through NVIDIA collaboration on AMG
Vratis Speed-IT library, development of CUSP-based AMG
Invest in alliances (but not development) with key OpenFOAM organizationsESI and OpenCFD Foundation (H. Weller, M. Salari)
Wikki and OpenFOAM-extend community (H. Jasak)
IDAJ Japan and ICON UK – support both OF and OF-ext
Conduct performance studies and customer benchmark evaluationsCollaborations: developers, customers, OEMs (Dell, SGI, HP, etc.)
NVIDIA Development Strategy for OpenFOAM
30
Culises: CFD Solver Library for OpenFOAM
www.fluidyna.de
FluiDyna: TU Munich Spin-Off from 2006
Culises provides a linear solver library
Culises requires only two edits to control file of OpenFOAM
Multi-GPU ready
Contact FluiDyna for license details
Culises Easy-to-Use AMG-PCG Solver:
#1. Download and license from http://www.FluiDyna.de
#2. Automatic installation with FluiDyna-provided script
#3. Activate Culises and GPUs with 2 edits to config-file
config-file CPU-only config-file CPU+GPU
www.fluidyna.de
31
OpenFOAM Speedups Based on CFD Application
www.fluidyna.de GPU Speedups for Different Industry Cases:
Job Speedup
Solver Speedup
OpenFOAM CPU-Only
Efficiency
Automotive1.6x
Multiphase1.9x
Thermal3.0x
Pharma CFD2.2x
Process CFD4.7x
Range of model sizes and different solver schemes (Krylov, AMG-PCG, etc.)
32
FluiDyna Culises: CFD Solver for OpenFOAM
Solver speedup of 7x for 2 CPU + 4 GPU
• 36M Cells (mixed
type)
• GAMG on CPU
• AMGPCG on GPU
Culises: A Library for Accelerated CFD on Hybrid GPU-CPU SystemsDr. Bjoern Landmann, FluiDynadeveloper.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0293-GTC2012-Culises-Hybrid-GPU.pdf www.fluidyna.de
DrivAer: Joint Car Body Shape by BMW and Audi
http://www.aer.mw.tum.de/en/research-groups/automotive/drivaer
Mesh Size - CPUs
9M - 2 CPU
18M - 2 CPU
36M - 2 CPU
GPUs +1 GPU +2 GPUs +4 GPUs
2.5x 4.2x 6.9x
Job Speedup
1.36x
1.52x 1.67x
33
Computational Structural Mechanics
ANSYS Mechanical
34
Model should be at least 500 KDOF or greater, more is betterEnsures enough computational work to justify use of a GPU
Models with solid FE’s will speedup more than shell FE’sGenerally not enough computational work in 2D shell elements
Direct solvers: moderate GPU memory and heavy system memory System memory needs capacity for entire system matrix (in-core)GPU memory needs capacity for a single matrix front
Iterative solvers: large GPU memory and moderate system memory GPU memory needs capacity for entire system matrix (in-core)
CSM Model Feature Recommendations for GPUs
35
ANSYS and NVIDIA Collaboration Roadmap
Release
ANSYS Mechanical
ANSYS Fluent ANSYS EM
13.0Dec 2010
SMP, Single GPU, Sparse and PCG/JCG
Solvers
ANSYS Nexxim
14.0Dec 2011
+ Distributed ANSYS;+ Multi-node Support
Radiation Heat Transfer (beta)
ANSYS Nexxim
14.5Nov 2012
+ Multi-GPU Support;+ Hybrid PCG;
+ Kepler GPU Support
+ Radiation HT;+ GPU AMG Solver (beta), Single GPU
ANSYS Nexxim
15.0Dec 2013
+ CUDA 5 Kepler Tuning
+ Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS NexximANSYS HFSS (Transient)
36
ANSYS Mechanical15.0 on Tesla GPUs
V14sp-5 Model
Turbine geometry2,100,000 DOFSOLID187 FEsStatic, nonlinearDistributed ANSYS 15.0Direct sparse solver
AN
SY
S M
ech
an
ical jo
bs/d
ay
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a Tesla K40 GPU with boost clocks.
2 CPU cores 2 CPU cores + Tesla K20
93
324
3.5X
Simulation productivity (with a HPC license)
8 CPU cores 7 CPU cores + Tesla K20
275
576
2.1X
Simulation productivity (with a HPC Pack)
8 CPU cores 7 CPU cores + Tesla K40
275
600
2.2X
Higher
is Bette
r
2 CPU cores 2 CPU cores + Tesla K40
93
363
3.9X
K20K40
K20K40
37
ANSYS Mechanical15.0 on Tesla K40
Higher
is Bette
rAN
SY
S M
ech
an
ical jo
bs/d
ay
2 CPU cores 2 CPU cores + Tesla K40
59
172
2.9X
Simulation productivity (with a HPC license)
8 CPU cores 7 CPU cores + Tesla K40
180
315
1.8X
Simulation productivity (with a HPC Pack)
V14sp-6 Model
4,900,000 DOFStatic, nonlinearDistributed ANSYS 15.0Direct sparse solver
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU and a Tesla K40 GPU with boost clocks.
38
Computational Structural Mechanics
Abaqus/Standard
39
Abaqus 6.11, June 2011
Direct sparse solver is accelerated on the GPU
Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000)
Abaqus 6.12, June 2012
Multi-GPU/node; multi-node DMP clusters
Flexibility to run jobs on specific GPUs
Fermi GPUs + Kepler Hotfix (since November 2012)
Abaqus 6.13, June 2013
Un-symmetric sparse solver on GPU
Official Kepler support (Tesla K20/K20X)
SIMUILA and Abaqus GPU Release Progression
40
8c 8c + 1g 8c + 2g 16c 16c + 2g0
5000
10000
15000
20000
1
1.5
2
2.5
3
3.5Elapsed Time in seconds Speed up relative to 8 core
Rolls Royce: Abaqus 3.5x Speedup with 5M DOF
Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2
• 4.71M DOF (equations); ~77 TFLOPs• Nonlinear Static (6 Steps)• Direct Sparse solver, 100GB memorySandy Bridge + Tesla K20X Single Server
Sp
eed
up
rela
tive t
o 8
core
(1
x)
2.42x
2.11x
41
Rolls Royce: Abaqus Speedups on an HPC Cluster
Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2
• 4.71M DOF (equations); ~77 TFLOPs• Nonlinear Static (6 Steps)• Direct Sparse solver, 100GB memorySandy Bridge + Tesla K20X for 4 x Servers
2.04X
1.8X
2 Servers 3 Servers
24c 24c+4g 36c 36c+6g 48c 48c8g0
3000
6000
9000
Elap
sed
Tim
e in
sec
onds 2.2x
1.9x
1.8x
4 Servers
42
Computational Structural Mechanics
MSC Nastran
4343
MSC Nastran Direct Equation Solver is GPU accelerated
Sparse direct factorization with no limit on model size
Real, Complex, Symmetric, Un-symmetric
Impacts several solution sequences:High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)
Support of multi-GPU and for Linux and Windows
NVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Quadro 6000
MSC Nastran Release 2013 for GPUs
44
SOL101, 2.4M rows, 42K front
SOL103, 2.6M rows, 18K front
0
1.5
3
4.5
6
serial 4c 4c+1g
MSC Nastran 2013 and GPU PerformanceSMP + GPU acceleration of SOL101 and SOL103
Higher is
Better
Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory
1X 1X
2.7X
1.9X
6X
2.8X
Lanczos solver (SOL 103)
Sparse matrix factorizationIterate on a block of vectors (solve)Orthogonalization of vectors
45
serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2)
8c + 2g (dmp=2)
0
200
400
600
800
1000
MSC Nastran 2013 and NVH Simulation on GPUsCoupled Structural-Acoustics simulation with SOL108
1X
Lower is
Better
Europe Auto OEM710K nodes, 3.83M elements100 frequency increments (FREQ1)Direct Sparse solver
4.8X
2.7X
5.2X
Ela
pse
d T
ime
(min
s)
5.5X
11.1X
Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory
46
Computational Structural Mechanics
Altair OptiStruct
47
GPU Performance of OptiStruct PCG Solver
1106
572
254
143
0
200
400
600
800
1000
1200
Ela
ps
ed
(s)
SMP 6-core
Hybrid 2 MPI x 6 SMP
SMP 6 + 1 GPU
Hybrid 2 MPI x 6 SMP + 2 GPUs
306
85
0
50
100
150
200
250
300
350
Ela
ps
ed
(s)
Hybrid 4 MPI x 6 SMPHybrid 4 MPI x 6 SMP + 4 GPUs
4.3X*7.5X*
13X*
2 x GPU on 1 Node 7.5X
4 x GPU on 2 Nodes 13X!
Problem: Hood of a car with pressure loads, displacements and stresses
Benchmark 2,2 Millions of Degrees of Freedom, 62 Millions of non zero
380000 Shells + 13000 Solids + 1100 RBE3
5300 iterations
Platform NVIDIA PSG Cluster – 2 nodes with:
Dual NVIDIA M2090 GPUs, Cuda v3.2
Intel Westmere 2x6 X5670@2,93Ghz
Linux RHEL 5.4 with Intel MPI 4.0
48
GPUs provide significant speedups for solver intensive simulations
Improved product quality with higher fidelity modelingShorten product engineering cycles with faster simulation turnaround
Simulations recently considered impractical now possible
FEA: Larger DOFs in model, more complex material behavior, FSICFD: Unsteady RANS, LES simulations practical in cost and timeEffective parameter optimization from large increase in number of jobs
Summary of GPU Progress for CAE