Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Slide 1 U N C L A S S I F I E D
Distributed Graph-based Density Matrix Calculation for Quantum
Molecular Dynamics using GPUs
April 4-7, 2016
2016 GPU Technology Conference
S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory
Slide 2 U N C L A S S I F I E D
Why Molecular Dynamics?
Slide 3 U N C L A S S I F I E D
Background
• In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom
• Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena
• Dependence on an interatomic potential to calculate forces and energy
• Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions
Slide 4 U N C L A S S I F I E D
Integrate the equations of motion of classical molecular trajectories
with the forces calculated on the fly from a self-consistent quantum mechanical description of the electronic structure:
MIR̈I = �⇥U(R; �sc)
⇥RI
H[�]�i = ⇥i�i
⇢ =X
occ.
|�i|2! ⇢sc
SCF� SCF�SCF�
SCF�
U(R; �)
U(R; �sc)
Born-Oppenheimer MD
#SCF�O(N3)
Quantum Molecular Dynamics (QMD)
Slide 5 U N C L A S S I F I E D
Example: Biosynthesis of Histidine
• What is responsible for the biosynthesis of histidine? • Present in several pathogenic bacteria • Allosteric mechanism was determined with MD simulation • Several thousand atoms over timescales of 100s of ns!
Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436.
Slide 6 U N C L A S S I F I E D
Future QMD Simulations
IAPP dimerization and type-2 diabetes (537 atoms).
Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms).
Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43.
Slide 7 U N C L A S S I F I E D
Computational Cost
• High computational cost and complexity of QMD calculations
• The MD timestep is the most expensive – the density matrix construction
• The second order spectral projection (SP2) algorithm breakthrough
• Use of hybrid parallelism on GPU-accelerated clusters
System Size (N)
Wal
l-Clo
ck T
ime
/ MD
Tim
e St
ep
O(N3) DiagonalizationO(N) Regular linear scalingO(N) Low pre-factor
Significant pre-factor reduction �is required for practical �
large scale QMD! �
Quantum Molecular Dynamics
Slide 8 U N C L A S S I F I E D
The Density Matrix Computation
• Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps
• New SP2 approach – Computed directly from the Hamiltonian through a recursive
expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm
– Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs
Slide 9 U N C L A S S I F I E D
0 0.5 1x
0
0.5
1
f(x
)
f(x) = x2
f(x) = 2x - x2
f8(f
7(...f
1(x) ...))
The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity
ρ = θ µI − H$% &' = lim
i→∞fi[ fi−1[… f0[X0 ]…]]
X0 =
εmaxI − Hεmax − εmin
fi[X i ]=X i
2 if 2Tr[X i ]≥ Ne2X i −X i
2 if 2Tr[X i ]< Ne
Recursive Fermi Operator expansion
Niklasson AMN, Phys. Rev. B 66, 155115 (2002).
Slide 10 U N C L A S S I F I E D
SP2 Algorithm Using the GPU Approach
Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do
Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp
end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.
Slide 11 U N C L A S S I F I E D
Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)
0 2000 4000 6000 8000 10000Matrix dimension
0
30
60
90
Tim
e per
den
sity
mat
rix b
uil
d (
s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization
Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.
Slide 12 U N C L A S S I F I E D
Sparse Matrix SP2 – ELLPACK-R Format
• Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row
• N rows and M max non-zeroes per row, O(Nm2) computational complexity
• Row-wise storage for parallelism opportunities
• No insertion cost compared to CSR
1" 1"
2" 4" 6" 1"
3" 1"
2" 4" 5" 1"
4" 5 1"
2" 6" 1"
1" 1" 1"
1" 1"
1" 1" 1" 1"
1" 1"
1" 1"
1" 1" 1"
1"
1"
1"
1"
1"
1"
1"
3"
1"
3"
2"
2"
Values" Columns" #"Non4zeroes"Dense"Matrix"Sparse"Matrix"
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
Slide 13 U N C L A S S I F I E D
SP2 Shared Memory – Significant Cost Reduction
0 1000 2000 3000 4000 5000 6000Number of atoms
0
2
4
6D
ensit
y M
atrix
Con
struc
tion
Tim
e (s
)Diag. SerialDiag. 16 ThreadsSP2, CSR SerialSP2, ELL 1 ThreadSP2, ELL 4 ThreadsSP2, ELL 16 Threads
Polyethylene
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
Slide 14 U N C L A S S I F I E D
Shared Memory SP2 – TRP Cage Protein
0 5 10 15Time (ps)
-39050
-39000
-38950
-38900
-38850
To
tal
ener
gy
(eV
)6 8 10 12 14 16 18
Time (ps)
-39028.1
-39028.0
-39027.9
-39027.8
To
tal
ener
gy
(eV
)
NVT NVE
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
303 atom Trp Cage Protein solvated by 2682 water molecules (8349 atoms)
LATTE Simulation – 18.8 ps
(Thermalization)
Microcanonical Simulation
Slide 15 U N C L A S S I F I E D
15
Sparse Matrix Algebra Divide and Conquer
Graph Theory
Graph-based Electronic Structure Theory
Slide 16 U N C L A S S I F I E D
Data dependency Graph S⌧
core
halo
i
s(i)�S� =
n
s(i)�
oN
i=1
Recursive Fermi-operator expansion D� =
n
limn!1
fn(fn�1(. . . f0(h[s(i)� ]) . . .))
oN
i=1H =
n
h[s(i)� ])oN
i=1
S⌧ � ⇥Fermi Operator Expansion⇤⌧(global)
Exact Relation!
Graph-based SP2
Slide 17 U N C L A S S I F I E D
Graph-based SP2 – Hybrid approach
On Gpus!
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
Slide 18 U N C L A S S I F I E D
Graph Partitioning SP2
Dt
0 0 0
0 0 0 0 0 0
0 0 0
Graph of H Partitioned graph of H
Graph Partitioning Structure-based Graph-based Hypergraph-based Community-based
Subgraph Processing 1. Determine core+halo 2. Extract submatrix 3. Run Dense SP2 4. Collect into next D
0 0
Base Halo
Subgraph of H
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Submatrix of H/D
Ht
Dt-1
Trivial parallelism based on dense matrix algebra
at BLAS3 performance
Partitions are sets of core rows/orbitals of H
Slide 19 U N C L A S S I F I E D
Subgraph Processing
Determine elements Run SP2 Extract submatrix Assemble into D
Determine elements Run SP2 Extract submatrix Assemble into D
Determine elements Run SP2 Extract submatrix Assemble into D
For all sub-graphs:
.
.
.
.
.
.
.
.
.
.
.
.
Dt
• Trivial parallelism • Same HOMO-LUMO sequence through SP2 • Dense matrix & communication-free SP2 • Small subgraphs, process single-threaded • Large subgraphs, process multi-threaded • Tunable accuracy - thresholds
Slide 20 U N C L A S S I F I E D
Liquid water (H2O)100
DFTB-LATTE
Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD)
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
Slide 21 U N C L A S S I F I E D 1×10-8 1×10-7 1×10-6 1×10-5 1×10-4
Numerical Threshold 1×10-5
1×10-4
1×10-3
1×10-2
1×10-1
||D-D
exac
t|| F per
ato
m
2048 No. Subgraphs1024 No. Subgraphs512 No. Subgraphs
Polyalanine in water 20,000 atoms
⌧
Graph-based SP2 – Tunable accuracy
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
Slide 22 U N C L A S S I F I E D
64 128 256 512 1024 2048 4096Number of communities
0.5
1
2
4
8
16
32
64
128
SP2
run
time
(s)
1 CPU SpM Alg (MKL)1 CPU Graph Part.1 GPU Graph Part.16 GPU Graph Part.32 GPU Graph Part.
Graph-based SP2 – Distributed GPU Performance
Density matrix calculations for Polyalanine in water 20,000 atoms
• 64-4096 METIS partitions
• Single Nvidia Tesla M2090 GPU per node
• Partition core+halo sizes vary, load-balancing required
• 16,384 GPU threads, still perfect strong scaling
• Best - ~25 µs/atom
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
Slide 23 U N C L A S S I F I E D
Graph-based SP2 – Distributed CPU vs. GPU
• 1024 2048 METIS partitions
• 16-256 CPU/GPU nodes • MKL vs. CuBLAS • Similar for 1024 and 2048
partitions • Near linear scaling • Speedup of 1.7X on GPUs • Best – 5 µs/atom
Density matrix calculations for Polyalanine in water 20,000 atoms
Slide 24 U N C L A S S I F I E D
Basic Matrix Library (BML) & PROGRESS Library
bml C API ● bml_multiply() ● bml_add() ● ...
bml Fortran API ● call bml_multiply() ● call bml_add() ● ...
dense matrix
sparse ELLPACK matrix
sparse CSR matrix
CPU/SMP
GPGPU
MPI
BML library core
bml l
ibra
ry
Uses BLAS Level 2 and 3 routines.
BML available at http://qmmd.github.io/bml/ under the BSD 3-clause license
Slide 25 U N C L A S S I F I E D
Summary
• Distributed graph-based SP2 provides significant speedup using distributed GPU-accelerated architectures
• Available as part of BML & PROGRESS libraries • Allows for QMD simulations of larger systems and longer
timeframes than previously possible