28
Introduction Abstraction & Framework Long Range Conclusion 1/28 A Performance Portable Framework for Molecular Simulations William Saunders * , James Grant , Eike Hermann Mueller * University of Bath * Mathematical Sciences, Chemistry [Comp. Phys. Comms. (2018) vol 224, pp. 119-135] and [arXiv:1708.01135] Scientific Computing Seminar, University of Warwick, Mon 18 th June 2018 Eike Mueller Performance Portable Molecular Simulation

A Performance Portable Framework for Molecular Simulations

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 1/28

A Performance Portable Frameworkfor Molecular Simulations

William Saunders*, James Grant†, Eike Hermann Mueller*

University of Bath*Mathematical Sciences, †Chemistry

[Comp. Phys. Comms. (2018) vol 224, pp. 119-135] and [arXiv:1708.01135]

Scientific Computing Seminar,University of Warwick, Mon 18th June 2018

Eike Mueller Performance Portable Molecular Simulation

Page 2: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 2/28

Overview

1 IntroductionMolecular SimulationThe Hardware Zoo

2 A Framework for Performance Portable Molecular DynamicsAbstractionData structuresPython code generation systemResults

3 Long Range InteractionsEwald sumationFast Multipole MethodResults

4 Conclusion

Eike Mueller Performance Portable Molecular Simulation

Page 3: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 3/28

Molecular Simulation

Molecular Simulation codes are major HPC users

ARCHER top 10 codes

1 VASP 16.8%2 Gromacs 8.3%3 CASTEP 5.0%4 cp2k 4.8%5 Q. Espresso 4.2%6 HYDRA 3.8%7 LAMMPS 3.7%8 OpenFOAM 3.2%9 NAMD 3.1%

10 WRF 2.1%

≈ 20% of time spent on molecular particle integration(and even more on quantum chemistry)

Eike Mueller Performance Portable Molecular Simulation

Page 4: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 4/28

The hardware zoo

Source: Wikipedia, Flickr, CC license

Several layers of parallelismDistributed memory across nodes (MPI)Shared memory on a node (OpenMP, threads, MPI, Intel TBB)Vectorisation (CUDA, intrinsics, vector libraries)

Complex memory hierarchy

⇒ Porting MD codes is hard

Structure analysis codes often handwritten and not parallel

Eike Mueller Performance Portable Molecular Simulation

Page 5: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 5/28

Molecular Dynamics

Simulating Particles

Follow trajectories of a large number (& 106) interacting particles

midv(i)

dt= F (i)(x(1), . . . , x(n)),

dx(i)

dt= v(i) for i = 1, . . . ,N

Key operations

Local updates x(i) 7→ x(i) + δx(i) = x(i) + δt · v(i)

Force calculation F (i) = F (i)(x(1), . . . , x(n))

Query local environment (structure analysis)

Eike Mueller Performance Portable Molecular Simulation

Page 6: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 6/28

Abstraction

“For all pairs of particles do operation Xpair(i, j)”

particle j

particle i

“For all pairs with |r(i) − r(j)| < rc do operation Xlocal pair(i, j)”

How this is executed is

Hardware specific

Of no interest to the scientist (domain specialist)

Eike Mueller Performance Portable Molecular Simulation

Page 7: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 7/28

Grid iteration in PDE solvers

Inspired by the (Py)OP2 library [Rathgeber et al. (2012)]

FC

for all cells C dofor all facets F of C do

aC 7→ aC + ρ · bF

end forend for

Python/C Source code

# Define Dats

a = op2.Dat(cells)

b = op2.Dat(facets)

# Local kernel code

kernel_code=’’’

void flux_update(double *a,

double **b) {

for (int r=0;r<4;++r)

a[0] += rho*b[r][0];

}’’’

# Define constant passed to kernel

rho = op2.Const(1, 0.3, name="rho")

# Define kernel

kernel = op2.Kernel(kernel_code)

# Define and execute pair loop

par_loop = op2.ParLoop(kernel,cells,

{’a’:a(op2.INC),

’b’:b(op2.READ,facet_map)})

Grid iteration/paralellisation hidden

from user!Eike Mueller Performance Portable Molecular Simulation

Page 8: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 8/28

Abstraction

Key idea: “Separation of Concerns”Domain specialist:

Local particle-pair kernelOverall algorithm

⇒ hardware independent DSL

Computational scientist:Framework to execute kernel overall pairs on a particular hardware

Eike Mueller Performance Portable Molecular Simulation

Page 9: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 9/28

Data structures

Each particle i = 1, . . . ,N can have r = 1, . . . ,M properties a(i)r

e.g.MassPosition, Velocity# of neighbours within distance rc

Store as 2d numpy arrays wrapped in Python ParticleDat objectsAccess as a.i[r] and a.j[r] in kernel Xpair(i, j)

particle index i

pro

pert

y r

Mg Global properties Sgr

Energy(binned) Radial Distribution Function (RDF)

Store as 1d numpy arrays wrapped in Python ScalarArray objects

Eike Mueller Performance Portable Molecular Simulation

Page 10: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 10/28

Execution model

Python Code-generation system

Domain specialist writessmall C-kernel whichdescribes Xlocal pair(i, j)

Access descriptors (READ,WRITE, INC, . . . )

auto-generated C-wrappercode for kernel execution

Necessary parallelisationcalls inserted, based onaccess descriptors

CPU CPU CPU

CPU CPU CPUGPU

Framework (code generation system)

Kernel (C)

Kernel (C)

Kernel (C)

Algorithm(Python)

uses

auto-generatedCPU+MPI code (C)

auto-generatedGPU code (C)

generates

execute

...

?

Dom

ain

speci

alis

tC

om

puta

tional

scie

nti

st

Eike Mueller Performance Portable Molecular Simulation

Page 11: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 11/28

Example

Input

Particle property a(vector valued)

Output

Particle property b

Global property Sg

b(i) =∑

pairs (i, j)

∣∣∣∣∣∣∣∣a(i) − a(j)∣∣∣∣∣∣∣∣2

=∑

pairs (i, j)

d−1∑r=0

(a(i)

r − a(j)r

)2

Sg =∑

pairs (i, j)

∣∣∣∣∣∣∣∣a(i) − a(j)∣∣∣∣∣∣∣∣4

Python/C Source code

dim=3 # dimension

npart=1000 # number of particles

# Define Particle Dats

a = ParticleDat(npart=npart,ncomp=dim)

b = ParticleDat(ncomp=1,npart=npart,

initial_value=0)

S = ScalarArray(ncomp=1,initial_value=0)

kernel_code=’’’

double da_sq = 0.0;

for (int r=0;r<dim;++r) {

double da = a.i[r]-a.j[r];

da_sq += da*da;

}

b.i[0] += da_sq; S += da_sq*da_sq;

}’’’

# Define constants passed to kernel

consts = (Constant(’dim’, dim),)

# Define kernel

kernel = Kernel(’update’,kernel_code,consts)

# Define and execute pair loop

pair_loop = PairLoop(kernel=kernel,

{’a’:a(access.READ),

’b’:b(access.INC),

’S’:S(access.INC)})

pair_loop.execute()

Eike Mueller Performance Portable Molecular Simulation

Page 12: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 12/28

High-level interface

Framework structure (again)

1 High-level algorithms(timestepper, thermostat, MCsampling, . . . ) implemented inPython

2 Local kernel efficiently executedover all pairs

User doesn’t see parallelisation

⇒ performance portability

CPU CPU CPU

CPU CPU CPUGPU

Framework (code generation system)

Kernel (C)

Kernel (C)

Kernel (C)

Algorithm(Python)

uses

auto-generatedCPU+MPI code (C)

auto-generatedGPU code (C)

generates

execute

...

?

Dom

ain

speci

alis

tC

om

puta

tional

scie

nti

st

Not just a Python scripting driver layer for existing MD backend

Eike Mueller Performance Portable Molecular Simulation

Page 13: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 13/28

Results I: Scalability

Strong scaling on Balena Lennard-Jones benchmark106 particles, compare to DL POLY and LAMMPS on CPU and GPU

116

416

816 1 2 4 8 16 32 64

Node/GPU count

101

102

103

104

Inte

grat

ion

tim

e(s

)

1 K20x

4 K20x

8 K20x

Framework

Framework K20x

LAMMPS K20x

LAMMPS

DL POLY

1.0 · 106

6.2 · 104

7.8 · 103

9.8 · 102Number of particles per CPU core

Solution time

116

416

816 1 2 4 8 16 32 64

Node/GPU count

0

20

40

60

80

100

120

140

Par

alle

leffi

cien

cy(%

)

1 K20x

4 K20x

8 K20xFramework

Framework K20x

LAMMPS K20x

LAMMPS

DL POLY

1.0 · 106

6.2 · 104

7.8 · 103

9.8 · 102Number of particles per CPU core

Parallel efficiency

Eike Mueller Performance Portable Molecular Simulation

Page 14: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 14/28

Structure analysis

Not restricted to force calculation

Bond Order Analysis [Steinhardt et al. (1983)]

Q(i)`

=

√√√4π

2` + 1

+∑m=−`

∣∣∣∣q(i)`m

∣∣∣∣2, q(i)`m =

1nnb

nnb−1∑j=0

Ym`

(r(i) − r(j)

)Common Neighbour Analysis [Honeycutt & Andersen (1987)]

Classify pairs by triplet (nnb, nb, nlcb)

# common neighbours nnb

# neighbour links nb

neighbour cluster size nlcb

Eike Mueller Performance Portable Molecular Simulation

Page 15: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 15/28

Results II: Structure Analysis

On-the-fly Bond order analysis

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Q6 value

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Pro

bab

ility

den

sity

Distribution of Q6

116

216

416

816 1 2 4

Node count

0

100

200

300

400

500

Inte

grat

ion

tim

e(s

)

3.3 · 104

1.3 · 105

5.2 · 105

2.1 · 106Number of particles

Parallel scalability

Eike Mueller Performance Portable Molecular Simulation

Page 16: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 16/28

Long range Interactions

Hang on, electrostatics isn’t cheap!

BUT

well-defined potential ∝ 1/r

⇒ only needs to be implemented once

Eike Mueller Performance Portable Molecular Simulation

Page 17: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 17/28

Long Range Interactions

Electrostatic potential

φ(r) =N∑

j=1

qj

|r − r j |

Can not truncate(consider potential of particle in constant charge background)

⇒ Interactions between all particle pairs O(N2)What about periodic BCs∗?

Three common approaches1 Ewald summation: O(N3/2)

2 Smooth Particle Mesh Ewald (SPME): O(N logN)

3 Fast Multipole Method (FMM): O(N)

∗or mirror charges in Dirichlet/Neumann BCsEike Mueller Performance Portable Molecular Simulation

Page 18: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 18/28

Ewald summation

Split charge density andpotential [Ewald (1921)]

ρ(r) = ρ(sr)(r) + ρ(lr)(r)

⇒ φ(r) = φ(sr)(r) + φ(lr)(r)

=

+

short range long range𝛼 -1

ρ(sr): exponentially screened φ(sr) ∼ e−α2r2

⇒ sum directly, truncate at rc ∼ 1/α⇒ Cost(sr) = C(sr)(rc) · N

ρ(lr): Calculate in Fourier space, truncate at kc ∼ α

⇒ Cost(lr) = C(lr)(kc) · N

Tune α(N), rc(α) and kc(α) to minimise total cost at fixed error[Kolafa and Perram (1992)]⇒ Computational complexity O(N3/2)

Eike Mueller Performance Portable Molecular Simulation

Page 19: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 19/28

Results

Computational Complexity [Saunders, Grant, Muller, arXiv:1708.01135]

1.7· 10

3

4.1· 10

3

8.0· 10

3

2.2· 10

4

4.7· 10

4

8.5· 10

4

1.8· 10

5

Particle count (N)

10−2

10−1

100

Tim

ep

erit

erat

ion

(s) Framework

DL POLY 4

N3/2

N1

Eike Mueller Performance Portable Molecular Simulation

Page 20: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 20/28

Results

Parallel Scalability

116

816 1 2 4 8 16

Node count

10−1

100

Tim

ep

erit

erat

ion

(s)

Framework MPI+OMP

Framework MPI

DL POLY 4

Framework MPI+OMP(α, rc) = (0.023, 22.1A)

3.3 · 104

2.0 · 103

5.1 · 102

1.3 · 102Number of particles per CPU core

116

816 1 2 4 8 16

Node count

0

50

100

150

200

250

Par

alle

leffi

cien

cy(%

)

Framework MPI+OMP

Framework MPI

DL POLY 4

3.3 · 104

2.0 · 103

5.1 · 102

1.3 · 102Number of particles per CPU core

Eike Mueller Performance Portable Molecular Simulation

Page 21: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 21/28

Fast Multipole Method

The exact structure of a cluster of charges becomes less importantwhen observed from further distances.

Similar effect with images

Eike Mueller Performance Portable Molecular Simulation

Page 22: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 22/28

Multipole expansion

Expansion in Spherical Harmonics

Y00 /r

Y {−1,0,1}1 /r2

Y {−2,−1,0,1,2}2 /r3

https://upload.wikimedia.org/wikipedia/commons/6/62/Spherical_Harmonics.png

Eike Mueller Performance Portable Molecular Simulation

Page 23: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 23/28

Fast Multipole Method

Fast Multipole Method [Greengard and Rokhlin (1987)]

φ(r , θ, φ) =∞∑`=0

+∑m=−`

a`mY`,m(θ, φ)

r`+1r > R

=∞∑`=0

+∑m=−`

b`mY`,m(θ, φ)r` r ≤ R

Upward passMultipole expansion on mesh hierarchy

Downward passIncrement local expansion

Evaluation of local expansion

⇒ Computational Complexity O(N)

Eike Mueller Performance Portable Molecular Simulation

Page 24: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 24/28

Fast Multipole Method

multipoleexpansion

shiftmultipole→multipole

shiftlocal←local

shiftmultipole→multipole

Direct summationof (far) periodic images

shiftlocal←local

Add well-separatedAdd well-separated

upward passupward pass

downward pass

shiftlocal←local

Direct summationof (near) periodic images

Eike Mueller Performance Portable Molecular Simulation

Page 25: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 25/28

Setup

Setup

DL POLY NaCl benchmark

electrostatics + repulsive LJ interactions (to prevent collapse)

Charges arranged in a simple cubic lattice a = 3.3Å,random initial velocities

periodic boundary conditions

Constant density ρ = (3.3Å)−3

Running 200 iterations, Velocity Verlet (no thermostat)

particles wiggle around their initial positions

Accuracy: error on total potential energy δE/E ≤ 10−5

Eike Mueller Performance Portable Molecular Simulation

Page 26: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 26/28

Results

Weak scaling (1mio→ 16mio particles, 16→ 256 cores)

16 32 64 128 256

Core Count

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0T

ime

take

np

erit

erat

ion

(s)

Weak scaling: 62500 particles per core

PPMD Ivy Bridge

Ideal Scaling

⇒ Confirms O(N) computational complexityEike Mueller Performance Portable Molecular Simulation

Page 27: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 27/28

Results

Strong scaling (62, 500→ 3, 906 particles/core, 16→ 256 cores)

32 12816 64 256

Core Count

10−1

100

Tim

eta

ken

per

iter

atio

n(s

)

Strong scaling: 106 particles

PPMD (FMM)

PPMD (FMM) MPI+OMP

DL POLY (SPME)

Ideal Scaling

Eike Mueller Performance Portable Molecular Simulation

Page 28: A Performance Portable Framework for Molecular Simulations

Introduction Abstraction & Framework Long Range Conclusion 28/28

Conclusion

Summary . . .Saunders, Grant, Mueller (2018) CPC vol 224, pp. 119-135 and arXiv:1708.01135

Performance portable (MPI, CUDA, CUDA+MPI, . . . )

Code generation framework based on“Separation of Concerns”Speed/scalability comparable to established MD codes

Arbitrary pair kernels, not just force calculation

Long range electrostatic interactions(Ewald, Fast Multipole, . . . ) currently CPU-only

. . . and OutlookGPU offload of FMM (in progress)

Multiple species

Constraints (molecules)

Eike Mueller Performance Portable Molecular Simulation