49
Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected] GPU Progress of CAE Applications

Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]@nvidia.com

Embed Size (px)

Citation preview

Page 1: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

Stan PoseyNVIDIA, Santa Clara, CA, USA; [email protected]

GPU Progress of CAE Applications

Page 2: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

2

Development of professional GPUs as co-processing accelerators for x86 CPUs

Strategic AlliancesBusiness and technical collaboration with ISVs; Industry customers; Research organizations

Applications EngineeringTechnical collaboration with ISVs (ANSYS, etc.) for development of GPU-accelerated solvers

Software DevelopmentNVIDIA linear solver toolkit (implicit iterative solvers) , CUDA libraries, GPU compilers

GPU System IntegrationHP, Dell, IBM, Cray, SGI, Fujitsu, others; Kepler K20 based-systems available since 2012

NVIDIA HPC Technology and CAE Strategy Technology

Strategy

Page 3: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

3

NVIDIA Kepler family GPUs for CAE simulations

K20 (5 GB)K20X (6 GB)K40 (12 GB) K6000 (12 GB)

GPU Product Summary for CAE Applications

Page 4: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

4

Parallel Computing

NVIDIA® MAXIMUS

Visual Computing

CAE Workstations Now Configure with 2 GPUs

CAD

Operations

Pre-

processing

Post-

processing

FEA

CFD

CEM

Intelligent GPU job Allocation

Unified Driver for Quadro + Tesla

ANSYS Certifications

HP, Dell, Xenon, others

Now Kepler-based GPUs

Available Since November 2011

Page 5: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

5

TITAN at ORNL 20+ PetaFlops18,688 NVIDIA Tesla K20x

NVIDIA GPUs Accelerate CAE at Any Scale

Same GPU Technology

from

MAXIMUS Workstations

to

TITAN — #2 at

Top500.orgMAXIMUS Workstation

Key Application S3D for Turbulent CombustionHow to efficiently burn next gen diesel and bio fuels?

Page 6: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

6

NVIDIA Use of CAE in Product Engineering

ANSYS Icepak – active and passive cooling of IC packages

ANSYS Mechanical – large deflection bending of PCBs

ANSYS Mechanical – comfort and fit of 3D emitter glasses

ANSYS Mechanical – shock & vib of solder ball assemblies

Page 7: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

7

Higher fidelity (better models)GPUs permit higher fidelity for existing (CPU-only) job times

Parameter sensitivities (more models)GPUs increase throughput for existing (CPU-only) job capacity, and at lower cost

Advanced techniquesGPUs make practical: high order methods, time dependent vs. static, use of 3D solid finite elements vs. 2D shells, etc.

Larger ISV software budgetsGPUs provide more use of existing ISV software investment

CAE Trends and GPU Acceleration Benefits

Page 8: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

8

Strong GPU investments by commercial CAE vendors (ISVs)GPU adoption led by implicit FEA and CEM, followed by CFD

Recent CFD breakthroughs in linear solvers (AMG) and preconditioners

GPUs now production-HPC for leading CAE end-user sitesLed by automotive, electronics, and aerospace industries

GPUs contributing to fast growth in emerging CAE applicationsNew developments in particle-based CFD (LBM, SPH, DEM, etc.)Rapid growth for range of CEM applications and GPU adoption

Progress Summary for GPU-Parallel CAE

Page 9: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

9

Available Today

Product Evaluation

Research Evaluation

GPU Status Structural Mechanics Fluid Dynamics ElectromagneticsANSYS MechanicalAbaqus/StandardMSC NastranMarcAFEANX NastranHyperWorks OptiStructPAM-CRASH implicitLS-DYNA implicit

ANSYS CFD (FLUENT)MoldflowCulises (OpenFOAM)Particleworks SpeedIT (OpenFOAM) AcuSolve

Abaqus/CFDLS-DYNA CFD

CFD++FloEFDSTAR-CCM+XFlow

LS-DYNAAbaqus/ExplicitRADIOSS PAM-CRASH

EMProCST MWSXFdtdSEMCAD XFEKONexxim

JMAGCFD-ACE+

GPU Progress – Commercial CAE Software

Xpatch

HFSS

Page 10: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

10

Additional Commercial GPU DevelopmentsISV Domain Location Primary

ApplicationsFluiDyna CFD Germany Culises for OpenFOAM;

LBultra

Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL

Prometech CFD Japan Particleworks

Turbostream CFD England, UK Turbostream

IMPETUS Explicit FEA Sweden AFEA

AVL CFD Austria FIRE

CoreTech CFD (molding) Taiwan Moldex3D

Intes Implicit FEA Germany PERMAS

Next Limit CFD Spain XFlow

CPFD CFD USA BARRACUDA

Convergent/IDAJ CFD USA Converge CFD

SCSK Implicit FEA Japan ADVENTURECluster

CDH Implicit FEA Germany AMLS; FastFRS

FunctionBay MB Dynamics S. Korea RecurDyn

Cradle Software CFD Japan SC/Tetra; scSTREAM

Page 11: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

11

Every primary ISV has products available on GPUs or ongoing evaluation

The 4 largest ISVs all have products based on GPUs, some at 3rd generationANSYS SIMULIA MSC Software Altair

The top 4 out of 5 ISV applications are available on GPUs todayANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only

Several new ISVs were founded with GPUs as a primary competitive strategyPrometech, FluiDyna, Vratis, IMPETUS, Turbostream

Availability of commercial CEM software expanding with ECAE growthCST, Remcom, Agilent, EMSS on 3rd-gen; JSOL to release JMAG, ANSYS to release HFSS

Status Summary of ISVs and GPU Acceleration

Page 12: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

12

CAE Software Focus on Sparse Solvers

CAE Application Software

+

GPU CPU- Hand-CUDA

Parallel- GPU Libraries,

CUBLAS- OpenACC

Directives

Implicit SparseMatrix Operations

40% - 75% ofProfile time, Small % LoC

(Investigating OpenACCfor more tasks on GPU)

Read input, matrix Set-up

Global solution, write output

Implicit SparseMatrix Operations

Page 13: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

13

Most time consumed in dense matrix operations such as Cholesky factorization, Schur complement, and others

Method decomposes global stiffness matrix into tree of dense matrix fronts

Most CSM implementations send dense operations to GPU while keeping the assembly tree transversal on the CPU

GPU Approach of Direct Solvers for Implicit CSM

Page 14: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

14

Lower threshold:

Fronts too small to overcome

PCIe data transfer costs stay on CPU

cores

Large dense matrix fronts factored on GPU

Small dense matrix fronts factored in parallel on CPU – more cores means higher performance

Typical implicit CSM deployment of multi-frontal sparse direct solvers

Schematic Representation of the Stiffness Matrix that is

Factorized by the Direct Solver

GPU Approach of Direct Solvers for Implicit CSM

Page 15: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

15

CAE Priority for ISV Software on GPUs

ANSYS / ANSYS Fluent

OpenFOAM (Various ISVs)

CD-adapco / STAR-CCM+

Autodesk Simulation CFD

ESI / CFD-ACE+

SIMULIA / Abaqus/CFD

ANSYS / ANSYS Mechanical

SIMULIA / Abaqus/Standard

MSC Software / MSC Nastran

MSC Software / Marc

LSTC / LS-DYNA implicit

Altair / RADIOSS Bulk

Siemens / NX Nastran

Autodesk / Mechanical

LSTC / LS-DYNA

SIMULIA / Abaqus/Explicit

Altair / RADIOSS

ESI / PAM-CRASH

ANSYS / ANSYS Mechanical

Altair / RADIOSS

Altair / AcuSolve (CFD)

Autodesk / Moldflow

#1

#2

#4

#3

Page 16: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

16

Basics of GPU Computing for ISV Software

ISV software use of GPU acceleration is user-transparent

Jobs launch and complete without additional user steps

User informs ISV application (GUI, command) that a GPU exists

Schematic of a CPU with an attached GPU acceleratorCPU begins/ends job, GPU manages heavy computations

Schematic of an x86 CPU with a GPU accelerator1. ISV job launched on CPU

2. Solver operations sent to GPU3. GPU sends results back to CPU4. ISV job completes on CPU

GD

DR

GD

DR

DDR

DDR

GPUI/OHub PCI-

Express

CPU

Cach

e

1

4

2

3

Page 17: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

17

Computational Fluid Dynamics

ANSYS Fluent

Page 18: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

18

ANSYS and NVIDIA Collaboration Roadmap

Release

ANSYS Mechanical

ANSYS Fluent ANSYS EM

13.0Dec 2010

SMP, Single GPU, Sparse and PCG/JCG

Solvers

ANSYS Nexxim

14.0Dec 2011

+ Distributed ANSYS;+ Multi-node Support

Radiation Heat Transfer (beta)

ANSYS Nexxim

14.5Nov 2012

+ Multi-GPU Support;+ Hybrid PCG;

+ Kepler GPU Support

+ Radiation HT;+ GPU AMG Solver (beta), Single GPU

ANSYS Nexxim

15.0Dec 2013

+ CUDA 5 Kepler Tuning

+ Multi-GPU AMG Solver;

+ CUDA 5 Kepler Tuning

ANSYS NexximANSYS HFSS (Transient)

Page 19: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

19

ANSYS 15.0 HPC License Scheme for GPUs Treats each GPU socket as a CPU core, which significantly

increases simulation productivity of your HPC licenses

Needs 1 HPC task to enable a GPU

All ANSYS HPC products unlock GPUs in 15.0, including HPC, HPC Pack, HPC Workgroup, and HPC Enterprise products.

Page 20: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

20

Solve Linear System of Equations: Ax = b

Assemble Linear System of Equations

No YesStop

Accelerate this first

~ 35%

~ 65%

Runtime:

Non-linear iterations

Converged ?

ANSYS Fluent Profile for Coupled PBNS Solver

Page 21: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

21

Overview of AmgX Linear Solver LibraryTwo forms of AMG

Classical AMG, as in HYPRE, strong convergence, scalar

Un-smoothed Aggregation AMG, lower setup times, handles block systems

Krylov methods GMRES, CG, BiCGStab, preconditioned and ‘flexible’ variants

Classic iterative methodsBlock-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1

Multi-colored versions for fine-grained parallelism

Flexible configurationAll methods as solvers, preconditioners, or smoothers; nesting

Designed for non-linear problemsAllows for frequently changing matrix, parallel and efficient setup

Page 22: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

22

No CUDA experience necessary to use the libraryC API: links with C, C++ or FortranSmall API, focused Reads common matrix formats (CSR, COO, MM)Single GPU, Multi-GPUInteroperates easily with MPI, OpenMP, and Hybrid parallel applicationsTuned for K20 & K40; supports Fermi and newerSingle, Double precisionSupported on Linux, Win64

AmgX Developed for Ease-of-Use

Page 23: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

23

How to Enable NVIDIA GPUs in ANSYS FluentWindows: Linux:

fluent 3ddp -g -ssh –t2 -gpgpu=1 -i journal.jou

Cluster specification:

nprocs = Total number of fluent processes

M = Number of machines

ngpgpus = Number of GPUs per machine

Requirement 1

nprocs mod M = 0

Same number of solver processes on each machine

Requirement 2

mod ngpgpus = 0

No. of processes should be an integer multiple of GPUs

Page 24: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

24

Considerations for ANSYS Fluent on GPUs• GPUs accelerate the AMG solver of the CFD analysis

– Fine meshes and low-dissipation problems have high %AMG– Coupled solution scheme spends 65% on average in AMG

• In many cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)

• The system matrix must fit in the GPU memory– For coupled PBNS, each 1 MM cells need about 4 GB of GPU memory– High-memory GPUs such as Tesla K40 or Quadro K6000 are ideal

• Better performance with use of lower CPU core counts– A ratio of 4 CPU cores to 1 GPU is recommended

Page 25: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

25

ANSYS Fluent GPU Performance for Large Cases

• External aerodynamics• Steady, k-e turbulence• Double-precision solver• CPU: Intel Xeon E5-2667;

12 cores per node• GPU: Tesla K40, 4 per node

Truck Body Model

14 million cells

13

9.5

111 million cells

36

18

144 CPU cores

144 CPU cores + 48 GPUs

1.4 X2 X

Lower

is Bette

r

36 CPU cores

36 CPU cores + 12 GPUs

AN

SY

S F

luen

t T

ime (

Sec)

NOTE: Reported times are sec per

iteration

ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013

Page 26: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

26

• 111M mixed cells• External aerodynamics• Steady, k-e turbulence• Double-precision solver• CPU: Intel Xeon E5-2667;

12 cores per node• GPU: Tesla K40, 4 per node

Truck Body Model144 CPU cores – Amg

48 GPUs – AmgX

AMG solver time per iteration (secs)

29

11

Fluent solution time per iteration (secs)

36

18

144 CPU cores

144 CPU cores + 48 GPUs

2.7 X

2 X

Lower

is Bette

r

80% AMG solver time

NOTE: AmgX is a linear solver toolkit from

NVIDIA, used by ANSYS

ANSYS Fluent GPU Performance for Large Cases ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013

Page 27: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

27

Series10

5

10

15

20

25

16 16

2 x Nodes x 2 CPUs (32 Cores Total)

8 GPUs (4 each Node)

14 M Mixed cellsSteady, k-e turbulenceCoupled PBNS, DPTotal solution timesCPU: AMG F-cycleGPU: FGMRES with AMG Preconditioner

Truck Body Model

4 x Nodes x 2 CPUs(64 Cores Total)

ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013

NOTE: All results

fully converged

ANSYS Fluent GPU Study on Productivity Gains

Higheris

Better

AN

SY

S F

luen

t N

um

ber

of

Job

s

Per

Day 32

Cores+ 8

GPUs

64 Cores

• Same solution times: 64 cores vs. 32 cores + 8 GPUs

• Frees up 32 CPUs and HPC licenses for additional job(s)

• Approximate 56% increase in overall productivity for 25% increase in cost

Page 28: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

28

Computational Fluid Dynamics

OpenFOAM

Page 29: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

29

Provide technical support for commercial GPU solver developments

FluiDyna Culises library through NVIDIA collaboration on AMG

Vratis Speed-IT library, development of CUSP-based AMG

Invest in alliances (but not development) with key OpenFOAM organizationsESI and OpenCFD Foundation (H. Weller, M. Salari)

Wikki and OpenFOAM-extend community (H. Jasak)

IDAJ Japan and ICON UK – support both OF and OF-ext

Conduct performance studies and customer benchmark evaluationsCollaborations: developers, customers, OEMs (Dell, SGI, HP, etc.)

NVIDIA Development Strategy for OpenFOAM

Page 30: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

30

Culises: CFD Solver Library for OpenFOAM

www.fluidyna.de

FluiDyna: TU Munich Spin-Off from 2006

Culises provides a linear solver library

Culises requires only two edits to control file of OpenFOAM

Multi-GPU ready

Contact FluiDyna for license details

Culises Easy-to-Use AMG-PCG Solver:

#1. Download and license from http://www.FluiDyna.de

#2. Automatic installation with FluiDyna-provided script

#3. Activate Culises and GPUs with 2 edits to config-file

config-file CPU-only config-file CPU+GPU

www.fluidyna.de

Page 31: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

31

OpenFOAM Speedups Based on CFD Application

www.fluidyna.de GPU Speedups for Different Industry Cases:

Job Speedup

Solver Speedup

OpenFOAM CPU-Only

Efficiency

Automotive1.6x

Multiphase1.9x

Thermal3.0x

Pharma CFD2.2x

Process CFD4.7x

Range of model sizes and different solver schemes (Krylov, AMG-PCG, etc.)

Page 32: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

32

FluiDyna Culises: CFD Solver for OpenFOAM

Solver speedup of 7x for 2 CPU + 4 GPU

• 36M Cells (mixed

type)

• GAMG on CPU

• AMGPCG on GPU

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU SystemsDr. Bjoern Landmann, FluiDynadeveloper.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0293-GTC2012-Culises-Hybrid-GPU.pdf www.fluidyna.de

DrivAer: Joint Car Body Shape by BMW and Audi

http://www.aer.mw.tum.de/en/research-groups/automotive/drivaer

Mesh Size - CPUs

9M - 2 CPU

18M - 2 CPU

36M - 2 CPU

GPUs +1 GPU +2 GPUs +4 GPUs

2.5x 4.2x 6.9x

Job Speedup

1.36x

1.52x 1.67x

Page 33: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

33

Computational Structural Mechanics

ANSYS Mechanical

Page 34: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

34

Model should be at least 500 KDOF or greater, more is betterEnsures enough computational work to justify use of a GPU

Models with solid FE’s will speedup more than shell FE’sGenerally not enough computational work in 2D shell elements

Direct solvers: moderate GPU memory and heavy system memory System memory needs capacity for entire system matrix (in-core)GPU memory needs capacity for a single matrix front

Iterative solvers: large GPU memory and moderate system memory GPU memory needs capacity for entire system matrix (in-core)

CSM Model Feature Recommendations for GPUs

Page 35: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

35

ANSYS and NVIDIA Collaboration Roadmap

Release

ANSYS Mechanical

ANSYS Fluent ANSYS EM

13.0Dec 2010

SMP, Single GPU, Sparse and PCG/JCG

Solvers

ANSYS Nexxim

14.0Dec 2011

+ Distributed ANSYS;+ Multi-node Support

Radiation Heat Transfer (beta)

ANSYS Nexxim

14.5Nov 2012

+ Multi-GPU Support;+ Hybrid PCG;

+ Kepler GPU Support

+ Radiation HT;+ GPU AMG Solver (beta), Single GPU

ANSYS Nexxim

15.0Dec 2013

+ CUDA 5 Kepler Tuning

+ Multi-GPU AMG Solver;

+ CUDA 5 Kepler Tuning

ANSYS NexximANSYS HFSS (Transient)

Page 36: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

36

ANSYS Mechanical15.0 on Tesla GPUs

V14sp-5 Model

Turbine geometry2,100,000 DOFSOLID187 FEsStatic, nonlinearDistributed ANSYS 15.0Direct sparse solver

AN

SY

S M

ech

an

ical jo

bs/d

ay

Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a Tesla K40 GPU with boost clocks.

2 CPU cores 2 CPU cores + Tesla K20

93

324

3.5X

Simulation productivity (with a HPC license)

8 CPU cores 7 CPU cores + Tesla K20

275

576

2.1X

Simulation productivity (with a HPC Pack)

8 CPU cores 7 CPU cores + Tesla K40

275

600

2.2X

Higher

is Bette

r

2 CPU cores 2 CPU cores + Tesla K40

93

363

3.9X

K20K40

K20K40

Page 37: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

37

ANSYS Mechanical15.0 on Tesla K40

Higher

is Bette

rAN

SY

S M

ech

an

ical jo

bs/d

ay

2 CPU cores 2 CPU cores + Tesla K40

59

172

2.9X

Simulation productivity (with a HPC license)

8 CPU cores 7 CPU cores + Tesla K40

180

315

1.8X

Simulation productivity (with a HPC Pack)

V14sp-6 Model

4,900,000 DOFStatic, nonlinearDistributed ANSYS 15.0Direct sparse solver

Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU and a Tesla K40 GPU with boost clocks.

Page 38: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

38

Computational Structural Mechanics

Abaqus/Standard

Page 39: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

39

Abaqus 6.11, June 2011

Direct sparse solver is accelerated on the GPU

Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000)

Abaqus 6.12, June 2012

Multi-GPU/node; multi-node DMP clusters

Flexibility to run jobs on specific GPUs

Fermi GPUs + Kepler Hotfix (since November 2012)

Abaqus 6.13, June 2013

Un-symmetric sparse solver on GPU

Official Kepler support (Tesla K20/K20X)

SIMUILA and Abaqus GPU Release Progression

Page 40: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

40

8c 8c + 1g 8c + 2g 16c 16c + 2g0

5000

10000

15000

20000

1

1.5

2

2.5

3

3.5Elapsed Time in seconds Speed up relative to 8 core

Rolls Royce: Abaqus 3.5x Speedup with 5M DOF

Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2

• 4.71M DOF (equations); ~77 TFLOPs• Nonlinear Static (6 Steps)• Direct Sparse solver, 100GB memorySandy Bridge + Tesla K20X Single Server

Sp

eed

up

rela

tive t

o 8

core

(1

x)

2.42x

2.11x

Page 41: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

41

Rolls Royce: Abaqus Speedups on an HPC Cluster

Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2

• 4.71M DOF (equations); ~77 TFLOPs• Nonlinear Static (6 Steps)• Direct Sparse solver, 100GB memorySandy Bridge + Tesla K20X for 4 x Servers

2.04X

1.8X

2 Servers 3 Servers

24c 24c+4g 36c 36c+6g 48c 48c8g0

3000

6000

9000

Elap

sed

Tim

e in

sec

onds 2.2x

1.9x

1.8x

4 Servers

Page 42: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

42

Computational Structural Mechanics

MSC Nastran

Page 43: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

4343

MSC Nastran Direct Equation Solver is GPU accelerated

Sparse direct factorization with no limit on model size

Real, Complex, Symmetric, Un-symmetric

Impacts several solution sequences:High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)

Support of multi-GPU and for Linux and Windows

NVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Quadro 6000

MSC Nastran Release 2013 for GPUs

Page 44: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

44

SOL101, 2.4M rows, 42K front

SOL103, 2.6M rows, 18K front

0

1.5

3

4.5

6

serial 4c 4c+1g

MSC Nastran 2013 and GPU PerformanceSMP + GPU acceleration of SOL101 and SOL103

Higher is

Better

Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

1X 1X

2.7X

1.9X

6X

2.8X

Lanczos solver (SOL 103)

Sparse matrix factorizationIterate on a block of vectors (solve)Orthogonalization of vectors

Page 45: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

45

serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2)

8c + 2g (dmp=2)

0

200

400

600

800

1000

MSC Nastran 2013 and NVH Simulation on GPUsCoupled Structural-Acoustics simulation with SOL108

1X

Lower is

Better

Europe Auto OEM710K nodes, 3.83M elements100 frequency increments (FREQ1)Direct Sparse solver

4.8X

2.7X

5.2X

Ela

pse

d T

ime

(min

s)

5.5X

11.1X

Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

Page 46: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

46

Computational Structural Mechanics

Altair OptiStruct

Page 47: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

47

GPU Performance of OptiStruct PCG Solver

1106

572

254

143

0

200

400

600

800

1000

1200

Ela

ps

ed

(s)

SMP 6-core

Hybrid 2 MPI x 6 SMP

SMP 6 + 1 GPU

Hybrid 2 MPI x 6 SMP + 2 GPUs

306

85

0

50

100

150

200

250

300

350

Ela

ps

ed

(s)

Hybrid 4 MPI x 6 SMPHybrid 4 MPI x 6 SMP + 4 GPUs

4.3X*7.5X*

13X*

2 x GPU on 1 Node 7.5X

4 x GPU on 2 Nodes 13X!

Problem: Hood of a car with pressure loads, displacements and stresses

Benchmark 2,2 Millions of Degrees of Freedom, 62 Millions of non zero

380000 Shells + 13000 Solids + 1100 RBE3

5300 iterations

Platform NVIDIA PSG Cluster – 2 nodes with:

Dual NVIDIA M2090 GPUs, Cuda v3.2

Intel Westmere 2x6 X5670@2,93Ghz

Linux RHEL 5.4 with Intel MPI 4.0

Page 48: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

48

GPUs provide significant speedups for solver intensive simulations

Improved product quality with higher fidelity modelingShorten product engineering cycles with faster simulation turnaround

Simulations recently considered impractical now possible

FEA: Larger DOFs in model, more complex material behavior, FSICFD: Unsteady RANS, LES simulations practical in cost and timeEffective parameter optimization from large increase in number of jobs

Summary of GPU Progress for CAE

Page 49: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.comsposey@nvidia.com

Stan PoseyNVIDIA, Santa Clara, CA, USA; [email protected]

Thank Youand

Questions?