GPU Enhancements for Noise, Vibration and Harshness on- Enhancements for Noise, Vibration and Harshness (NVH) ... MSC Software Confidential Marc 2012 ... • NVIDIA and MSC continue

  • View
    216

  • Download
    4

Embed Size (px)

Text of GPU Enhancements for Noise, Vibration and Harshness on- Enhancements for Noise, Vibration and...

  • MSC Software Confidential

    GPU Enhancements for Noise, Vibration

    and Harshness (NVH) Analysis

    Dr. Ted Wertheimer

  • MSC Software Confidential MSC Software Confidential

    20 Million DOF - 3.9 M elements

    2 3/20/2013

  • MSC Software Confidential MSC Software Confidential

    This model extracted many modes:

    up to 1500 Hz structure -> ~26500 modes

    up to 1500 Hz fluid -> ~3200 modes

    Large frequency range: 0 to 1024 Hz in 2048 frequency steps

    20 Million DOF

    3 3/20/2013

    # Nodes DMP SMP Elapsed Time

    4 16 * 4 4:58:09

  • MSC Software Confidential MSC Software Confidential

    94 Million DOF

    4 3/20/2013

  • MSC Software Confidential MSC Software Confidential

    Automated Component Modal Synthesis

    (ACMS)

    MSC Nastran model is automatically divided

    into N domains

    Executes in parallel using Distributed Memory

    Parallel (DMP)

    Shared Memory Parallel (SMP) provides additional

    speedup

    ACMS

  • MSC Software Confidential MSC Software Confidential

    1 2 3 4 6 7 8 9 10 11 12 13 14 15 16

    0

    25

    21 23 22 24

    26

    20 19 18 17

    30

    28 27

    Master

    Slave 2

    Slave 1

    Slave 3

    29

    Example with DMP=4

    ACMS Domain Decomposition

    5

  • MSC Software Confidential MSC Software Confidential

    Multi-CPU, multi-core parallel scalability

    2X performance increase from 2010

    MSC Nastran ACMS Automotive Models

    0

    200

    400

    600

    800

    serial 12 CPUs serial 12 CPUs serial 12 CPUs serial 12 CPUs

    Case 1 Case 2 Case 3 Case 4

    ACMS)

    2010

    2011.1

    2011.22012

  • MSC Software Confidential MSC Software Confidential

    Up to 3X faster for exterior acoustics

    Exterior acoustics

    Brake squeal

    Friction

    Rotordynamics

    Nonsymmetric Solver Performance

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    fr resp total job

    Case 3

    Exterior acoustics

    2011.1

    2011.22012

  • MSC Software Confidential MSC Software Confidential

    Improved Performance for Acoustics

    Efficient Participation Factor

    3 Times Faster

    MSC Nastran 2012 MSC Nastran 2010

  • MSC Software Confidential MSC Software Confidential

    Nastran direct equation solver is GPU accelerated Sparse direct factorization (MSCLDL, MSCLU)

    Real, Complex, Symmetric, Un-symmetric

    Handles very large fronts with minimal use of pinned host memory Lowest granularity GPU implementation of a sparse

    direct solver; solves unlimited sparse matrix sizes

    Impacts several solution sequences: High impact (SOL101, SOL108), Mid (SOL103), Low

    (SOL111, SOL400)

    MSC Nastran 2013

    10

  • MSC Software Confidential MSC Software Confidential

    Support of multi-GPU and for Linux and Windows With DMP> 1, multiple fronts are factorized

    concurrently on multiple GPUs; 1 GPU per matrix domain

    NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla

    C2075, Quadro 6000 CUDA 5.0

    MSC Nastran 2013

    11

  • MSC Software Confidential MSC Software Confidential

    Direct sparse solver workflow

    in MSC Nastran (MSCLDL, MSCLU)

    3/20/2013

    In a proper order, do the

    following at each node.

    Assembly

    Pivoting

    Block factorization:

    from Global Stiffness &

    contribution blocks

    11

    9 10

    8

    6 7

    5

    3 4

    1 2

    Most time-consuming matrix update operations on GPU

    Off-diagonal

    update

    Diagonal

    decomposition Schur Complement

    Trailing matrix update

  • MSC Software Confidential

    Block LU Decomposition

    Direct solves are (typically) performed using Block LU

    decomposition

    Spend most of their time computing the Schur Complement

    Compute bound / low hanging fruit

    A11 A12

    A21 A22

    0

    L21 I

    I 0

    0 A22

    L21U12 0

    = * *

    U12

    I

    L11 U11

    DGEMM

    DTRSM DPOTRF DPOTRF

    DTRSM

    L11 U11 = A11 L11 U12 = A12 L21 U11 = A21

  • MSC Software Confidential

    PCIe limit on Schur complement calculation.

    (DGEMM)

    PCIe limts GPU performance

    Host is faster for small fronts

    Requires nRank >700 for full perf on K20

    M2090 and K20 are same until nRank

    >300

  • MSC Software Confidential MSC Software Confidential

    0

    1.5

    3

    4.5

    6

    SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front

    serial 4c 4c+1g

    MSC Nastran 2013

    SMP + GPU acceleration of SOL101 and SOL103

    Higher is

    Better

    Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

    1X 1X

    2.7X

    1.9X

    6X

    2.8X

    Lanczos solver (SOL 103) Sparse matrix factorization

    Iterate on a block of vectors

    (solve)

    Orthogonalization of vectors

  • MSC Software Confidential MSC Software Confidential

    0

    200

    400

    600

    800

    1000

    serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)

    8c + 2g(dmp=2)

    NVH with MSC Nastran 2013

    Coupled Structural-Acoustics simulation with SOL108

    1X

    Lower is Better

    Europe Auto OEM 710K nodes, 3.83M elements

    100 frequency increments

    (FREQ1)

    Direct Sparse solver

    4.8X

    2.7X

    5.2X 5.5X

    11.1X

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    Ela

    psed

    Tim

    e in M

    inu

    tes

  • MSC Software Confidential

    MSC Nastran 2013:

    Solution Price-Performance Gain

  • MSC Software Confidential MSC Software Confidential

    0

    20

    40

    60

    80

    serial smp 4c smp 4c+1g(x1 node)

    dmp 4c+1g(x2 nodes)

    dmp 4c+1g(x3 nodes)

    Elap

    sed

    Tim

    e in

    Ho

    urs

    NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    1X

    2.5X

    Lower is Better

    USA Auto OEM 1.2M nodes, 7.47M DOF

    Shells (CQUAD4): 1.04M

    Solids (CTETRA): 0.1M

    100 frequency increments

    (FREQ1)

    4.4X

    6.8X 9X

  • MSC Software Confidential MSC Software Confidential

    Japan Auto OEM Nodes 1.4M, Elements 0.78M

    Mainly TETRA10

    Modes: 104 (2500 Hz )

    Front size: 23,718

    NVH with MSC Nastran 2013

    Engine Model Modal Frequency with SOL111

    2848

    1000

    614

    586

    2807

    901

    2303

    2168

    0

    2000

    4000

    6000

    8000

    10000

    1CPU(9052sec.)

    1CPU+1GPU(5116sec.)

    CPU Time

    Tim

    e(s

    ec.)

    FBS+Matrix-vectorMultply

    Shift+Decomposition

    Sparse Decomposition

    only

    335 239

    2856

    1027

    6180

    4120

    291

    223

    0

    2000

    4000

    6000

    8000

    10000

    12000

    1CPU(9702sec.)

    1CPU+1GPU(5647sec.)

    Elaps Time

    Tim

    e(s

    ec.)

    Pre_Eigenvalue

    Eigenvalue

    Resvec

    Post_Eigenvalue

    1.7x speedup

  • MSC Software Confidential MSC Software Confidential

    Marc multi-frontal sparse solver is GPU accelerated Marc Solver type 8

    Support of multi-GPU and for Linux and Windows Recommend 1 GPU per DDM

    Marc 2012

    3/20/2013

  • MSC Software Confidential MSC Software Confidential

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    Serial 1c + 1gpu

    nps=2 nps=2, 2gpus

    nps=4, 2gpus

    Marc 2012 - Automotive Engine model (1M DOF)

    Marc 2012 GPU Acceleration

    Customer model

    6.5X Speedup with 2 GPUs over Serial run

    DOF: 1M

    Elements: 170K

  • MSC Software Confidential MSC Software Confidential

    Marc 2012 GPU Acceleration of US Auto OEM

    model

    22 3/20/2013

    Speed Up End to End

    2.5 Million Elements

    10 Million DOF

    Nonlinear Bolt Tightening

    48 Iterations

    0

    0.5

    1

    1.5

    2

    2.5

    3

    Serial (1c) 4c 1c+1 GPU

  • MSC Software Confidential

    Conclusions

    GPUs provide for significant performance acceleration for direct

    solver intensive large jobs, ie. max front > 10000 for real data and

    > 5000 for complex data models.

    Multiple GPU performance is available with DMP>1 including for

    NVH SOL108 (embarrassingly parallel).

    NVIDIA and MSC continue to work together to tune BLAS and

    LAPACK kernels for MSCLDL and MSCLU.

    As Models become larger the value of GPGPU becomes Greater

    23

  • MSC Software Confidential MSC Software Confidential

    Thank You

    24 3/20/2013

Recommended

View more >