Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology

Slide 1 U N C L A S S I F I E D

Distributed Graph-based Density Matrix Calculation for Quantum

Molecular Dynamics using GPUs

April 4-7, 2016

2016 GPU Technology Conference

S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory

[email protected]


Why Molecular Dynamics?


Background

•  In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom

•  Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena

•  Dependence on an interatomic potential to calculate forces and energy

•  Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions


Integrate the equations of motion of classical molecular trajectories

with the forces calculated on the fly from a self-consistent quantum mechanical description of the electronic structure:

MIR̈I = �⇥U(R; �sc)

⇥RI

H[�]�i = ⇥i�i

⇢ =X

occ.

|�i|2! ⇢sc

SCF� SCF�SCF�

SCF�

U(R; �)

U(R; �sc)

Born-Oppenheimer MD

#SCF�O(N3)

Quantum Molecular Dynamics (QMD)


Example: Biosynthesis of Histidine

•  What is responsible for the biosynthesis of histidine? •  Present in several pathogenic bacteria •  Allosteric mechanism was determined with MD simulation •  Several thousand atoms over timescales of 100s of ns!

Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436.


Future QMD Simulations

IAPP dimerization and type-2 diabetes (537 atoms).

Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms).

Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43.


Computational Cost

•  High computational cost and complexity of QMD calculations

•  The MD timestep is the most expensive – the density matrix construction

•  The second order spectral projection (SP2) algorithm breakthrough

•  Use of hybrid parallelism on GPU-accelerated clusters

System Size (N)

Wal

l-Clo

ck T

ime

/ MD

Tim

e St

ep

O(N3) DiagonalizationO(N) Regular linear scalingO(N) Low pre-factor

Significant pre-factor reduction �is required for practical �

large scale QMD! �

Quantum Molecular Dynamics


The Density Matrix Computation

•  Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs –  Due to their complexity –  Difficulty in extracting thread-level parallelism –  Difficulty of avoiding branching within warps

•  New SP2 approach –  Computed directly from the Hamiltonian through a recursive

expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm

–  Based on a series of generalized matrix-matrix multiplications –  Only one matrix-matrix multiplication is required per iteration –  Maps very well to GPUs


0 0.5 1x

0

0.5

1

f(x

)

f(x) = x2

f(x) = 2x - x2

f8(f

7(...f

1(x) ...))

The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity

ρ = θ µI − H$% &' = lim

i→∞fi[ fi−1[… f0[X0 ]…]]

X0 =

εmaxI − Hεmax − εmin

fi[X i ]=X i

2 if 2Tr[X i ]≥ Ne2X i −X i

2 if 2Tr[X i ]< Ne

Recursive Fermi Operator expansion

Niklasson AMN, Phys. Rev. B 66, 155115 (2002).


SP2 Algorithm Using the GPU Approach

Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp

end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.


Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)

0 2000 4000 6000 8000 10000Matrix dimension

0

30

60

90

Tim

e per

den

sity

mat

rix b

uil

d (

s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization

Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.


Sparse Matrix SP2 – ELLPACK-R Format

•  Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row

•  N rows and M max non-zeroes per row, O(Nm2) computational complexity

•  Row-wise storage for parallelism opportunities

•  No insertion cost compared to CSR

1" 1"

2" 4" 6" 1"

3" 1"

2" 4" 5" 1"

4" 5 1"

2" 6" 1"

1" 1" 1"

1" 1"

1" 1" 1" 1"

1" 1"

1" 1"

1" 1" 1"

1"

1"

1"

1"

1"

1"

1"

3"

1"

3"

2"

2"

Values" Columns" #"Non4zeroes"Dense"Matrix"Sparse"Matrix"

Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.


SP2 Shared Memory – Significant Cost Reduction

0 1000 2000 3000 4000 5000 6000Number of atoms

0

2

4

6D

ensit

y M

atrix

Con

struc

tion

Tim

e (s

)Diag. SerialDiag. 16 ThreadsSP2, CSR SerialSP2, ELL 1 ThreadSP2, ELL 4 ThreadsSP2, ELL 16 Threads

Polyethylene



Shared Memory SP2 – TRP Cage Protein

0 5 10 15Time (ps)

-39050

-39000

-38950

-38900

-38850

To

tal

ener

gy

(eV

)6 8 10 12 14 16 18

Time (ps)

-39028.1

-39028.0

-39027.9

-39027.8

To

tal

ener

gy

(eV

)

NVT NVE


303 atom Trp Cage Protein solvated by 2682 water molecules (8349 atoms)

LATTE Simulation – 18.8 ps

(Thermalization)

Microcanonical Simulation


15

Sparse Matrix Algebra Divide and Conquer

Graph Theory

Graph-based Electronic Structure Theory


Data dependency Graph S⌧

core

halo

i

s(i)�S� =

n

s(i)�

oN

i=1

Recursive Fermi-operator expansion D� =

n

limn!1

fn(fn�1(. . . f0(h[s(i)� ]) . . .))

oN

i=1H =

n

h[s(i)� ])oN

i=1

S⌧ � ⇥Fermi Operator Expansion⇤⌧(global)

Exact Relation!

Graph-based SP2


Graph-based SP2 – Hybrid approach

On Gpus!

Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.


Graph Partitioning SP2

Dt

0 0 0

0 0 0 0 0 0

0 0 0

Graph of H Partitioned graph of H

Graph Partitioning Structure-based Graph-based Hypergraph-based Community-based

Subgraph Processing 1. Determine core+halo 2. Extract submatrix 3. Run Dense SP2 4. Collect into next D

0 0

Base Halo

Subgraph of H

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Submatrix of H/D

Ht

Dt-1

Trivial parallelism based on dense matrix algebra

at BLAS3 performance

Partitions are sets of core rows/orbitals of H


Subgraph Processing

Determine elements Run SP2 Extract submatrix Assemble into D



For all sub-graphs:

.

.

.

.

.

.

.

.

.

.

.

.

Dt

•  Trivial parallelism •  Same HOMO-LUMO sequence through SP2 •  Dense matrix & communication-free SP2 •  Small subgraphs, process single-threaded •  Large subgraphs, process multi-threaded •  Tunable accuracy - thresholds


Liquid water (H2O)100

DFTB-LATTE

Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD)


Slide 21 U N C L A S S I F I E D 1×10-8 1×10-7 1×10-6 1×10-5 1×10-4

Numerical Threshold 1×10-5

1×10-4

1×10-3

1×10-2

1×10-1

||D-D

exac

t|| F per

ato

m

2048 No. Subgraphs1024 No. Subgraphs512 No. Subgraphs

Polyalanine in water 20,000 atoms

⌧

Graph-based SP2 – Tunable accuracy



64 128 256 512 1024 2048 4096Number of communities

0.5

1

2

4

8

16

32

64

128

SP2

run

time

(s)

1 CPU SpM Alg (MKL)1 CPU Graph Part.1 GPU Graph Part.16 GPU Graph Part.32 GPU Graph Part.

Graph-based SP2 – Distributed GPU Performance

Density matrix calculations for Polyalanine in water 20,000 atoms

•  64-4096 METIS partitions

•  Single Nvidia Tesla M2090 GPU per node

•  Partition core+halo sizes vary, load-balancing required

•  16,384 GPU threads, still perfect strong scaling

•  Best - ~25 µs/atom



Graph-based SP2 – Distributed CPU vs. GPU

•  1024 2048 METIS partitions

•  16-256 CPU/GPU nodes •  MKL vs. CuBLAS •  Similar for 1024 and 2048

partitions •  Near linear scaling •  Speedup of 1.7X on GPUs •  Best – 5 µs/atom

Density matrix calculations for Polyalanine in water 20,000 atoms


Basic Matrix Library (BML) & PROGRESS Library

bml C API ●  bml_multiply() ●  bml_add() ●  ...

bml Fortran API ●  call bml_multiply() ●  call bml_add() ●  ...

dense matrix

sparse ELLPACK matrix

sparse CSR matrix

CPU/SMP

GPGPU

MPI

BML library core

bml l

ibra

ry

Uses BLAS Level 2 and 3 routines.

BML available at http://qmmd.github.io/bml/ under the BSD 3-clause license


Summary

•  Distributed graph-based SP2 provides significant speedup using distributed GPU-accelerated architectures

•  Available as part of BML & PROGRESS libraries •  Allows for QMD simulations of larger systems and longer

timeframes than previously possible

Documents

Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology