25
Slide 1 UNCLASSIFIED Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology Conference S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory [email protected]

Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 1 U N C L A S S I F I E D

Distributed Graph-based Density Matrix Calculation for Quantum

Molecular Dynamics using GPUs

April 4-7, 2016

2016 GPU Technology Conference

S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory

[email protected]

Page 2: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 2 U N C L A S S I F I E D

Why Molecular Dynamics?

Page 3: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 3 U N C L A S S I F I E D

Background

•  In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom

•  Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena

•  Dependence on an interatomic potential to calculate forces and energy

•  Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions

Page 4: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 4 U N C L A S S I F I E D

Integrate the equations of motion of classical molecular trajectories

with the forces calculated on the fly from a self-consistent quantum mechanical description of the electronic structure:

MIR̈I = �⇥U(R; �sc)

⇥RI

H[�]�i = ⇥i�i

⇢ =X

occ.

|�i|2! ⇢sc

SCF� SCF�SCF�

SCF�

U(R; �)

U(R; �sc)

Born-Oppenheimer MD

#SCF�O(N3)

Quantum Molecular Dynamics (QMD)

Page 5: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 5 U N C L A S S I F I E D

Example: Biosynthesis of Histidine

•  What is responsible for the biosynthesis of histidine? •  Present in several pathogenic bacteria •  Allosteric mechanism was determined with MD simulation •  Several thousand atoms over timescales of 100s of ns!

Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436.

Page 6: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 6 U N C L A S S I F I E D

Future QMD Simulations

IAPP dimerization and type-2 diabetes (537 atoms).

Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms).

Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43.

Page 7: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 7 U N C L A S S I F I E D

Computational Cost

•  High computational cost and complexity of QMD calculations

•  The MD timestep is the most expensive – the density matrix construction

•  The second order spectral projection (SP2) algorithm breakthrough

•  Use of hybrid parallelism on GPU-accelerated clusters

System Size (N)

Wal

l-Clo

ck T

ime

/ MD

Tim

e St

ep

O(N3) DiagonalizationO(N) Regular linear scalingO(N) Low pre-factor

Significant pre-factor reduction �is required for practical �

large scale QMD! �

Quantum Molecular Dynamics

Page 8: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 8 U N C L A S S I F I E D

The Density Matrix Computation

•  Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs –  Due to their complexity –  Difficulty in extracting thread-level parallelism –  Difficulty of avoiding branching within warps

•  New SP2 approach –  Computed directly from the Hamiltonian through a recursive

expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm

–  Based on a series of generalized matrix-matrix multiplications –  Only one matrix-matrix multiplication is required per iteration –  Maps very well to GPUs

Page 9: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 9 U N C L A S S I F I E D

0 0.5 1x

0

0.5

1

f(x

)

f(x) = x2

f(x) = 2x - x2

f8(f

7(...f

1(x) ...))

The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity

ρ = θ µI − H$% &' = lim

i→∞fi[ fi−1[… f0[X0 ]…]]

X0 =

εmaxI − Hεmax − εmin

fi[X i ]=X i

2 if 2Tr[X i ]≥ Ne2X i −X i

2 if 2Tr[X i ]< Ne

Recursive Fermi Operator expansion

Niklasson AMN, Phys. Rev. B 66, 155115 (2002).

Page 10: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 10 U N C L A S S I F I E D

SP2 Algorithm Using the GPU Approach

Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp

end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.

Page 11: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 11 U N C L A S S I F I E D

Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)

0 2000 4000 6000 8000 10000Matrix dimension

0

30

60

90

Tim

e per

den

sity

mat

rix b

uil

d (

s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization

Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.

Page 12: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 12 U N C L A S S I F I E D

Sparse Matrix SP2 – ELLPACK-R Format

•  Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row

•  N rows and M max non-zeroes per row, O(Nm2) computational complexity

•  Row-wise storage for parallelism opportunities

•  No insertion cost compared to CSR

1" 1"

2" 4" 6" 1"

3" 1"

2" 4" 5" 1"

4" 5 1"

2" 6" 1"

1" 1" 1"

1" 1"

1" 1" 1" 1"

1" 1"

1" 1"

1" 1" 1"

1"

1"

1"

1"

1"

1"

1"

3"

1"

3"

2"

2"

Values" Columns" #"Non4zeroes"Dense"Matrix"Sparse"Matrix"

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

Page 13: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 13 U N C L A S S I F I E D

SP2 Shared Memory – Significant Cost Reduction

0 1000 2000 3000 4000 5000 6000Number of atoms

0

2

4

6D

ensit

y M

atrix

Con

struc

tion

Tim

e (s

)Diag. SerialDiag. 16 ThreadsSP2, CSR SerialSP2, ELL 1 ThreadSP2, ELL 4 ThreadsSP2, ELL 16 Threads

Polyethylene

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

Page 14: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 14 U N C L A S S I F I E D

Shared Memory SP2 – TRP Cage Protein

0 5 10 15Time (ps)

-39050

-39000

-38950

-38900

-38850

To

tal

ener

gy

(eV

)6 8 10 12 14 16 18

Time (ps)

-39028.1

-39028.0

-39027.9

-39027.8

To

tal

ener

gy

(eV

)

NVT NVE

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.

303 atom Trp Cage Protein solvated by 2682 water molecules (8349 atoms)

LATTE Simulation – 18.8 ps

(Thermalization)

Microcanonical Simulation

Page 15: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 15 U N C L A S S I F I E D

15

Sparse Matrix Algebra Divide and Conquer

Graph Theory

Graph-based Electronic Structure Theory

Page 16: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 16 U N C L A S S I F I E D

Data dependency Graph S⌧

core

halo

i

s(i)�S� =

n

s(i)�

oN

i=1

Recursive Fermi-operator expansion D� =

n

limn!1

fn(fn�1(. . . f0(h[s(i)� ]) . . .))

oN

i=1H =

n

h[s(i)� ])oN

i=1

S⌧ � ⇥Fermi Operator Expansion⇤⌧(global)

Exact Relation!

Graph-based SP2

Page 17: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 17 U N C L A S S I F I E D

Graph-based SP2 – Hybrid approach

On Gpus!

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

Page 18: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 18 U N C L A S S I F I E D

Graph Partitioning SP2

Dt

0 0 0

0 0 0 0 0 0

0 0 0

Graph of H Partitioned graph of H

Graph Partitioning Structure-based Graph-based Hypergraph-based Community-based

Subgraph Processing 1. Determine core+halo 2. Extract submatrix 3. Run Dense SP2 4. Collect into next D

0 0

Base Halo

Subgraph of H

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Submatrix of H/D

Ht

Dt-1

Trivial parallelism based on dense matrix algebra

at BLAS3 performance

Partitions are sets of core rows/orbitals of H

Page 19: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 19 U N C L A S S I F I E D

Subgraph Processing

Determine elements Run SP2 Extract submatrix Assemble into D

Determine elements Run SP2 Extract submatrix Assemble into D

Determine elements Run SP2 Extract submatrix Assemble into D

For all sub-graphs:

.

.

.

.

.

.

.

.

.

.

.

.

Dt

•  Trivial parallelism •  Same HOMO-LUMO sequence through SP2 •  Dense matrix & communication-free SP2 •  Small subgraphs, process single-threaded •  Large subgraphs, process multi-threaded •  Tunable accuracy - thresholds

Page 20: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 20 U N C L A S S I F I E D

Liquid water (H2O)100

DFTB-LATTE

Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD)

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

Page 21: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 21 U N C L A S S I F I E D 1×10-8 1×10-7 1×10-6 1×10-5 1×10-4

Numerical Threshold 1×10-5

1×10-4

1×10-3

1×10-2

1×10-1

||D-D

exac

t|| F per

ato

m

2048 No. Subgraphs1024 No. Subgraphs512 No. Subgraphs

Polyalanine in water 20,000 atoms

Graph-based SP2 – Tunable accuracy

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

Page 22: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 22 U N C L A S S I F I E D

64 128 256 512 1024 2048 4096Number of communities

0.5

1

2

4

8

16

32

64

128

SP2

run

time

(s)

1 CPU SpM Alg (MKL)1 CPU Graph Part.1 GPU Graph Part.16 GPU Graph Part.32 GPU Graph Part.

Graph-based SP2 – Distributed GPU Performance

Density matrix calculations for Polyalanine in water 20,000 atoms

•  64-4096 METIS partitions

•  Single Nvidia Tesla M2090 GPU per node

•  Partition core+halo sizes vary, load-balancing required

•  16,384 GPU threads, still perfect strong scaling

•  Best - ~25 µs/atom

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.

Page 23: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 23 U N C L A S S I F I E D

Graph-based SP2 – Distributed CPU vs. GPU

•  1024 2048 METIS partitions

•  16-256 CPU/GPU nodes •  MKL vs. CuBLAS •  Similar for 1024 and 2048

partitions •  Near linear scaling •  Speedup of 1.7X on GPUs •  Best – 5 µs/atom

Density matrix calculations for Polyalanine in water 20,000 atoms

Page 24: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 24 U N C L A S S I F I E D

Basic Matrix Library (BML) & PROGRESS Library

bml C API ●  bml_multiply() ●  bml_add() ●  ...

bml Fortran API ●  call bml_multiply() ●  call bml_add() ●  ...

dense matrix

sparse ELLPACK matrix

sparse CSR matrix

CPU/SMP

GPGPU

MPI

BML library core

bml l

ibra

ry

Uses BLAS Level 2 and 3 routines.

BML available at http://qmmd.github.io/bml/ under the BSD 3-clause license

Page 25: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 25 U N C L A S S I F I E D

Summary

•  Distributed graph-based SP2 provides significant speedup using distributed GPU-accelerated architectures

•  Available as part of BML & PROGRESS libraries •  Allows for QMD simulations of larger systems and longer

timeframes than previously possible