18
Operated by Los Alamos National Security, LLC for the U.S. Department of Energys NNSA U N C L A S S I F I E D Slide 1 Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE S. Mniszewski*, M. Cawkwell, A. Niklasson GPU Technology Conference San Jose, California March 18-21, 2013 *[email protected]

GTC On Demand | NVIDIA GTC Digital - Fast Quantum Molecular Dynamics on LATTE | GTC 2013 · 2013. 3. 21. · gtc 2013. Keywords explicitly quantum mechanical reactive molecular dynamics

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Slide 1

    Fast Quantum Molecular Dynamics on Multi-GPU Architectures in

    LATTE

    S. Mniszewski*, M. Cawkwell, A. Niklasson

    GPU Technology Conference

    San Jose, California

    March 18-21, 2013

    *[email protected]

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D Slide 2

    Background: Quantum Molecular Dynamics

    n  In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom

    n  Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena

    n  Dependence on an interatomic potential to calculate forces and energy

    n  Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Quantum-based Interatomic Potentials n  Electronic structure of atoms and molecules is modeled explicitly

    n  Most accurate and reliable descriptions of interatomic bonding

    n  Their prohibitive computational cost has prevented widespread use – better algorithms and GPU architectures are important paths forward

    Slide 3

    E = 2Tr ρH"# $%

    fi = −2Tr ρ

    ∂H∂R i

    $

    %&

    '

    ()

    Energy

    Force Number of atoms

    Tim

    e pe

    r M

    D t

    ime

    step

    Quantum MD with improved algorithms and GPUs

    Empirical pair potentials

    n  Hamiltonian matrix H

    n  The density matrix, rho, is computed self-consistently from H

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    The Density Matrix Computation

    n  Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs •  Due to their complexity •  Difficulty in extracting thread-level parallelism •  Difficulty of avoiding branching within warps

    n  New approach in LATTE •  Computed directly from the Hamiltonian through a recursive expansion of the Fermi

    Operator with the second order spectral projection (SP2) algorithm •  Based on a series of generalized matrix-matrix multiplications •  Only one matrix-matrix multiplication is required per iteration •  Maps very well to GPUs

    Slide 4

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    0 0.5 1x

    0

    0.5

    1

    f(x

    )

    f(x) = x2

    f(x) = 2x - x2

    f8(f

    7(...f

    1(x) ...))

    The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity

    Slide 5

    ρ = θ µI − H$% &' = limi→∞ fi[ fi−1[… f0[X0 ]…]]

    X0 =

    εmaxI − Hεmax − εmin

    fi[X i ] =X i

    2 if 2Tr[X i ] ≥ N

    2X i − X i2 if 2Tr[X i ] < Ne

    Recursive Fermi Operator expansion

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    The GPU Implementation n  Part of the LATTE codebase

    •  Employs a semi-empirical tight-binding model of interatomic bonding that is based on the formalisms derived from density functional theory

    •  Density matrix build is by far the slowest step in the calculation •  CPU version in Fortran 90

    n  Hardware/Software Architecture •  Keeneland* cluster at the National Institute for Computational Sciences •  CPU - 2 Intel hex-core Xeon CPUs per node, Intel Fortran Compiler, MKL •  3 Nvidia M2090 GPUs (previously M2070 GPUs) •  CUDA 4.2, CUBLAS, and a thread block size of 1024

    n  Use of CUDA Features on GPUs •  Unified Virtual Addressing •  Peer to peer memory access/copy •  Streams –sequence of commands •  Single thread access to all GPUs

    Slide 6

    *J.S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S. Yalamanchili, “Keeneland: Bringing heterogeneous GPU computing to the computational science community,” IEEE Computing in Science and Engineering, 13(5):90-5, 2011, http://dx.doi.org/10.1109/MCSE.2011.83.

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

    Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp TraceX = TraceX + TraceXtmp else X = X – Xtmp TraceX = TraceX – TraceXtmp

    end until ρ = X

    SP2 Algorithm Using the Hybrid CPU/GPU Approach

    Slide 7

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    SP2 Algorithm Using the Full GPU Approach

    Slide 8

    Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

    Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp

    end until ρ = X

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    CUBLAS Matrix Multiplication Performance (Nvidia M2070)

    Slide 9

    Song, F., Tomov, S., Dongarra, J. "Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems," 26th ACM International Conference on Supercomputing (ICS 2012), ACM, San Servolo Island, Venice, Italy, June, 2012.

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Array Padding for Performance (M x M)

    Slide 10

    0.30

    0.35

    0.40

    0.45

    Tim

    e per

    xG

    EM

    M e

    xec

    uti

    on (

    s)

    3500 3600 3700 3800 3900 4000M

    0.14

    0.16

    0.18

    0.20

    0.22

    (a)

    (b)

    Double

    Single

    Average time to execute the CUBLAS xGEMM generalized matrix-matrix multiplication for M × M matrices. (a) DGEMM, (b) SGEMM. The broken and solid lines correspond computations performed with and without the padding of the arrays, respectively.

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Performance Analysis: Density Matrix Calculation (Nvidia M2070) – Liquid Methane (10-1000 molecules)

    Slide 11

    100 1000Number of orbitals

    0.01

    0.1

    1

    10

    100

    Tim

    e p

    er ρ

    cal

    cula

    tio

    n (

    s)

    0 2000 4000 6000 8000Number of orbitals

    50

    100

    150

    200

    250

    300T

    ime

    per

    ρ c

    alcu

    lati

    on (

    s)

    DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2: GPU algorithm 3

    Double

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    10-16

    10-15

    10-14

    10-13

    || ρ

    2 -

    ρ ||

    2

    DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2: GPU algorithm 3

    Error Analysis in Density Matrices – Idempotency Measure

    Slide 12

    The SP2 algorithm has errors that are independent of system size whereas traditional diagonalization yields errors that increase with the number of atoms.

    ρ2 = ρ

    Error = ρ2 − ρ

    2

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Multi-GPU Generalized Matrix-Matrix Multiplication

    n  Using multiple streams for sub-block matrix-matrix multiplications, additions, and matrix traces

    n  Efficient reassembly of blocked matrix via native functionality of CUBLAS DGEMM/SGEMM

    Slide 13

    !

    !

    =

    GPU 0

    GPU N-1

    . . .

    . . .

    GPU 0

    GPU N-1

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Using Streams for SP2 Sub-block Matrix Operations

    Slide 14

    Matrix blocks assembled in GPU 0 and redistributed

    X2 + X = Xtmp (XGEMM)

    Tr[Xtmp]

    X +/- Xtmp (xAXPY)

    GPU 0 – Stream 0 GPU 1 – Stream 1

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Performance Analysis: Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)

    Slide 15

    0 2000 4000 6000 8000 10000Matrix dimension

    0

    30

    60

    90

    Tim

    e per

    den

    sity

    mat

    rix b

    uil

    d (

    s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Performance Analysis for 1-3 GPUs

    Slide 16

    1000 10000Matrix dimension

    0.01

    0.1

    1

    10

    100T

    ime

    per

    den

    sity

    mat

    rix

    bu

    ild

    (s) 1 GPU

    2 GPUs3 GPUs

    1 GPU optimal 2 GPUs optimal 3 GPUs optimal

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Summary

    n  GPUs can be effectively used for the density matrix computation in quantum mechanical models

    n  The recursive SP2 algorithm is well suited to the GPU architecture

    n  Transfer of arrays between the CPU and GPU are a minor performance contribution

    n  Array padding is important for performance

    n  The GPU version of the SP2 algorithm provides comparable or better accuracy over traditional diagonalization

    n  Massive speed-ups with respect to traditional algorithms have been seen with no loss of accuracy

    Slide 17

  • Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

    U N C L A S S I F I E D

    Related Publications

    1.  Sanville EJ, Bock N, Coe J, Mniszewski SM, Niklasson AMN, Cawkwell MJ, 2010, LATTE. Los Alamos National Laboratory (LA-CC-10-004), http://savannah.nongpu.org/projects/latte.

    2.  Cawkwell MJ, Sanville EJ, Mniszewski SM, Niklasson AMN, 2012, Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units. J. Chem. Theory Comput., Vol 8, Issue 11, pp. 4094-4101, http://pubs.acs.org/doi/full/10.1021/ct300442w.

    3.  Mniszewski SM, Cawkwell MJ, Niklasson AMN, 2013, Quantum-based Dynamics on Graphics Processing Units. 2013 Associate Directorate for Theory, Simulation, and Computation (ADTSC) Highlights (in press).

    Slide 18