Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction Abstraction & Framework Long Range Conclusion 1/28
A Performance Portable Frameworkfor Molecular Simulations
William Saunders*, James Grant†, Eike Hermann Mueller*
University of Bath*Mathematical Sciences, †Chemistry
[Comp. Phys. Comms. (2018) vol 224, pp. 119-135] and [arXiv:1708.01135]
Scientific Computing Seminar,University of Warwick, Mon 18th June 2018
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 2/28
Overview
1 IntroductionMolecular SimulationThe Hardware Zoo
2 A Framework for Performance Portable Molecular DynamicsAbstractionData structuresPython code generation systemResults
3 Long Range InteractionsEwald sumationFast Multipole MethodResults
4 Conclusion
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 3/28
Molecular Simulation
Molecular Simulation codes are major HPC users
ARCHER top 10 codes
1 VASP 16.8%2 Gromacs 8.3%3 CASTEP 5.0%4 cp2k 4.8%5 Q. Espresso 4.2%6 HYDRA 3.8%7 LAMMPS 3.7%8 OpenFOAM 3.2%9 NAMD 3.1%
10 WRF 2.1%
≈ 20% of time spent on molecular particle integration(and even more on quantum chemistry)
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 4/28
The hardware zoo
Source: Wikipedia, Flickr, CC license
Several layers of parallelismDistributed memory across nodes (MPI)Shared memory on a node (OpenMP, threads, MPI, Intel TBB)Vectorisation (CUDA, intrinsics, vector libraries)
Complex memory hierarchy
⇒ Porting MD codes is hard
Structure analysis codes often handwritten and not parallel
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 5/28
Molecular Dynamics
Simulating Particles
Follow trajectories of a large number (& 106) interacting particles
midv(i)
dt= F (i)(x(1), . . . , x(n)),
dx(i)
dt= v(i) for i = 1, . . . ,N
Key operations
Local updates x(i) 7→ x(i) + δx(i) = x(i) + δt · v(i)
Force calculation F (i) = F (i)(x(1), . . . , x(n))
Query local environment (structure analysis)
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 6/28
Abstraction
“For all pairs of particles do operation Xpair(i, j)”
particle j
particle i
“For all pairs with |r(i) − r(j)| < rc do operation Xlocal pair(i, j)”
How this is executed is
Hardware specific
Of no interest to the scientist (domain specialist)
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 7/28
Grid iteration in PDE solvers
Inspired by the (Py)OP2 library [Rathgeber et al. (2012)]
FC
for all cells C dofor all facets F of C do
aC 7→ aC + ρ · bF
end forend for
Python/C Source code
# Define Dats
a = op2.Dat(cells)
b = op2.Dat(facets)
# Local kernel code
kernel_code=’’’
void flux_update(double *a,
double **b) {
for (int r=0;r<4;++r)
a[0] += rho*b[r][0];
}’’’
# Define constant passed to kernel
rho = op2.Const(1, 0.3, name="rho")
# Define kernel
kernel = op2.Kernel(kernel_code)
# Define and execute pair loop
par_loop = op2.ParLoop(kernel,cells,
{’a’:a(op2.INC),
’b’:b(op2.READ,facet_map)})
Grid iteration/paralellisation hidden
from user!Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 8/28
Abstraction
Key idea: “Separation of Concerns”Domain specialist:
Local particle-pair kernelOverall algorithm
⇒ hardware independent DSL
Computational scientist:Framework to execute kernel overall pairs on a particular hardware
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 9/28
Data structures
Each particle i = 1, . . . ,N can have r = 1, . . . ,M properties a(i)r
e.g.MassPosition, Velocity# of neighbours within distance rc
Store as 2d numpy arrays wrapped in Python ParticleDat objectsAccess as a.i[r] and a.j[r] in kernel Xpair(i, j)
particle index i
pro
pert
y r
Mg Global properties Sgr
Energy(binned) Radial Distribution Function (RDF)
Store as 1d numpy arrays wrapped in Python ScalarArray objects
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 10/28
Execution model
Python Code-generation system
Domain specialist writessmall C-kernel whichdescribes Xlocal pair(i, j)
Access descriptors (READ,WRITE, INC, . . . )
auto-generated C-wrappercode for kernel execution
Necessary parallelisationcalls inserted, based onaccess descriptors
CPU CPU CPU
CPU CPU CPUGPU
Framework (code generation system)
Kernel (C)
Kernel (C)
Kernel (C)
Algorithm(Python)
uses
auto-generatedCPU+MPI code (C)
auto-generatedGPU code (C)
generates
execute
...
?
Dom
ain
speci
alis
tC
om
puta
tional
scie
nti
st
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 11/28
Example
Input
Particle property a(vector valued)
Output
Particle property b
Global property Sg
b(i) =∑
pairs (i, j)
∣∣∣∣∣∣∣∣a(i) − a(j)∣∣∣∣∣∣∣∣2
=∑
pairs (i, j)
d−1∑r=0
(a(i)
r − a(j)r
)2
Sg =∑
pairs (i, j)
∣∣∣∣∣∣∣∣a(i) − a(j)∣∣∣∣∣∣∣∣4
Python/C Source code
dim=3 # dimension
npart=1000 # number of particles
# Define Particle Dats
a = ParticleDat(npart=npart,ncomp=dim)
b = ParticleDat(ncomp=1,npart=npart,
initial_value=0)
S = ScalarArray(ncomp=1,initial_value=0)
kernel_code=’’’
double da_sq = 0.0;
for (int r=0;r<dim;++r) {
double da = a.i[r]-a.j[r];
da_sq += da*da;
}
b.i[0] += da_sq; S += da_sq*da_sq;
}’’’
# Define constants passed to kernel
consts = (Constant(’dim’, dim),)
# Define kernel
kernel = Kernel(’update’,kernel_code,consts)
# Define and execute pair loop
pair_loop = PairLoop(kernel=kernel,
{’a’:a(access.READ),
’b’:b(access.INC),
’S’:S(access.INC)})
pair_loop.execute()
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 12/28
High-level interface
Framework structure (again)
1 High-level algorithms(timestepper, thermostat, MCsampling, . . . ) implemented inPython
2 Local kernel efficiently executedover all pairs
User doesn’t see parallelisation
⇒ performance portability
CPU CPU CPU
CPU CPU CPUGPU
Framework (code generation system)
Kernel (C)
Kernel (C)
Kernel (C)
Algorithm(Python)
uses
auto-generatedCPU+MPI code (C)
auto-generatedGPU code (C)
generates
execute
...
?
Dom
ain
speci
alis
tC
om
puta
tional
scie
nti
st
Not just a Python scripting driver layer for existing MD backend
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 13/28
Results I: Scalability
Strong scaling on Balena Lennard-Jones benchmark106 particles, compare to DL POLY and LAMMPS on CPU and GPU
116
416
816 1 2 4 8 16 32 64
Node/GPU count
101
102
103
104
Inte
grat
ion
tim
e(s
)
1 K20x
4 K20x
8 K20x
Framework
Framework K20x
LAMMPS K20x
LAMMPS
DL POLY
1.0 · 106
6.2 · 104
7.8 · 103
9.8 · 102Number of particles per CPU core
Solution time
116
416
816 1 2 4 8 16 32 64
Node/GPU count
0
20
40
60
80
100
120
140
Par
alle
leffi
cien
cy(%
)
1 K20x
4 K20x
8 K20xFramework
Framework K20x
LAMMPS K20x
LAMMPS
DL POLY
1.0 · 106
6.2 · 104
7.8 · 103
9.8 · 102Number of particles per CPU core
Parallel efficiency
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 14/28
Structure analysis
Not restricted to force calculation
Bond Order Analysis [Steinhardt et al. (1983)]
Q(i)`
=
√√√4π
2` + 1
+∑m=−`
∣∣∣∣q(i)`m
∣∣∣∣2, q(i)`m =
1nnb
nnb−1∑j=0
Ym`
(r(i) − r(j)
)Common Neighbour Analysis [Honeycutt & Andersen (1987)]
Classify pairs by triplet (nnb, nb, nlcb)
# common neighbours nnb
# neighbour links nb
neighbour cluster size nlcb
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 15/28
Results II: Structure Analysis
On-the-fly Bond order analysis
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Q6 value
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Pro
bab
ility
den
sity
Distribution of Q6
116
216
416
816 1 2 4
Node count
0
100
200
300
400
500
Inte
grat
ion
tim
e(s
)
3.3 · 104
1.3 · 105
5.2 · 105
2.1 · 106Number of particles
Parallel scalability
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 16/28
Long range Interactions
Hang on, electrostatics isn’t cheap!
BUT
well-defined potential ∝ 1/r
⇒ only needs to be implemented once
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 17/28
Long Range Interactions
Electrostatic potential
φ(r) =N∑
j=1
qj
|r − r j |
Can not truncate(consider potential of particle in constant charge background)
⇒ Interactions between all particle pairs O(N2)What about periodic BCs∗?
Three common approaches1 Ewald summation: O(N3/2)
2 Smooth Particle Mesh Ewald (SPME): O(N logN)
3 Fast Multipole Method (FMM): O(N)
∗or mirror charges in Dirichlet/Neumann BCsEike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 18/28
Ewald summation
Split charge density andpotential [Ewald (1921)]
ρ(r) = ρ(sr)(r) + ρ(lr)(r)
⇒ φ(r) = φ(sr)(r) + φ(lr)(r)
=
+
short range long range𝛼 -1
ρ(sr): exponentially screened φ(sr) ∼ e−α2r2
⇒ sum directly, truncate at rc ∼ 1/α⇒ Cost(sr) = C(sr)(rc) · N
ρ(lr): Calculate in Fourier space, truncate at kc ∼ α
⇒ Cost(lr) = C(lr)(kc) · N
Tune α(N), rc(α) and kc(α) to minimise total cost at fixed error[Kolafa and Perram (1992)]⇒ Computational complexity O(N3/2)
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 19/28
Results
Computational Complexity [Saunders, Grant, Muller, arXiv:1708.01135]
1.7· 10
3
4.1· 10
3
8.0· 10
3
2.2· 10
4
4.7· 10
4
8.5· 10
4
1.8· 10
5
Particle count (N)
10−2
10−1
100
Tim
ep
erit
erat
ion
(s) Framework
DL POLY 4
N3/2
N1
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 20/28
Results
Parallel Scalability
116
816 1 2 4 8 16
Node count
10−1
100
Tim
ep
erit
erat
ion
(s)
Framework MPI+OMP
Framework MPI
DL POLY 4
Framework MPI+OMP(α, rc) = (0.023, 22.1A)
3.3 · 104
2.0 · 103
5.1 · 102
1.3 · 102Number of particles per CPU core
116
816 1 2 4 8 16
Node count
0
50
100
150
200
250
Par
alle
leffi
cien
cy(%
)
Framework MPI+OMP
Framework MPI
DL POLY 4
3.3 · 104
2.0 · 103
5.1 · 102
1.3 · 102Number of particles per CPU core
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 21/28
Fast Multipole Method
The exact structure of a cluster of charges becomes less importantwhen observed from further distances.
Similar effect with images
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 22/28
Multipole expansion
Expansion in Spherical Harmonics
Y00 /r
Y {−1,0,1}1 /r2
Y {−2,−1,0,1,2}2 /r3
https://upload.wikimedia.org/wikipedia/commons/6/62/Spherical_Harmonics.png
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 23/28
Fast Multipole Method
Fast Multipole Method [Greengard and Rokhlin (1987)]
φ(r , θ, φ) =∞∑`=0
+∑m=−`
a`mY`,m(θ, φ)
r`+1r > R
=∞∑`=0
+∑m=−`
b`mY`,m(θ, φ)r` r ≤ R
Upward passMultipole expansion on mesh hierarchy
Downward passIncrement local expansion
Evaluation of local expansion
⇒ Computational Complexity O(N)
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 24/28
Fast Multipole Method
multipoleexpansion
shiftmultipole→multipole
shiftlocal←local
shiftmultipole→multipole
Direct summationof (far) periodic images
shiftlocal←local
Add well-separatedAdd well-separated
upward passupward pass
downward pass
shiftlocal←local
Direct summationof (near) periodic images
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 25/28
Setup
Setup
DL POLY NaCl benchmark
electrostatics + repulsive LJ interactions (to prevent collapse)
Charges arranged in a simple cubic lattice a = 3.3Å,random initial velocities
periodic boundary conditions
Constant density ρ = (3.3Å)−3
Running 200 iterations, Velocity Verlet (no thermostat)
particles wiggle around their initial positions
Accuracy: error on total potential energy δE/E ≤ 10−5
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 26/28
Results
Weak scaling (1mio→ 16mio particles, 16→ 256 cores)
16 32 64 128 256
Core Count
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0T
ime
take
np
erit
erat
ion
(s)
Weak scaling: 62500 particles per core
PPMD Ivy Bridge
Ideal Scaling
⇒ Confirms O(N) computational complexityEike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 27/28
Results
Strong scaling (62, 500→ 3, 906 particles/core, 16→ 256 cores)
32 12816 64 256
Core Count
10−1
100
Tim
eta
ken
per
iter
atio
n(s
)
Strong scaling: 106 particles
PPMD (FMM)
PPMD (FMM) MPI+OMP
DL POLY (SPME)
Ideal Scaling
Eike Mueller Performance Portable Molecular Simulation
Introduction Abstraction & Framework Long Range Conclusion 28/28
Conclusion
Summary . . .Saunders, Grant, Mueller (2018) CPC vol 224, pp. 119-135 and arXiv:1708.01135
Performance portable (MPI, CUDA, CUDA+MPI, . . . )
Code generation framework based on“Separation of Concerns”Speed/scalability comparable to established MD codes
Arbitrary pair kernels, not just force calculation
Long range electrostatic interactions(Ewald, Fast Multipole, . . . ) currently CPU-only
. . . and OutlookGPU offload of FMM (in progress)
Multiple species
Constraints (molecules)
Eike Mueller Performance Portable Molecular Simulation