What is Molecular Dynamics? · 1.2 What is molecular dynamics? We call molecular dynamics (MD) a computer simulation technique where the time evolution ... 1.3 Some historical notes

Parallel Algorithms for Short-Range Molecular Dynamics

What is Molecular Dynamics?Biophysics Application Project

Emiliano Ippoliti

September 14, 2011

CONTENTS CONTENTS

Contents1 Introduction 4

1.1 The role of computer experiments . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 What is molecular dynamics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Some historical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Use of classical forces . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 Realism of forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.3 Time and size limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 The basic machinery 92.1 Modeling the physical system . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Lennard-Jones potential . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Potential truncation and long-range corrections . . . . . . . . . . . . . . 11

2.2 Periodic boundaries conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 The minimum image criterion . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Time integration algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 The Verlet algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Velocity Verlet algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Leap-frog algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Error terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Simple statistical quantities 173.1 Potential energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Kinetic energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Total energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Radial distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 MSD and VACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7 Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Computational aspects of MD 224.1 MD algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Long- vs short-range force model . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Neighbor lists and link-cell method . . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Computational aspects of our model 275.1 Reduced units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2

CONTENTS CONTENTS

6 Atom-decomposition algorithm 316.1 Expand and fold operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 The simpler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3 Using Newton’s 3rd law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Force-decomposition algorithm 367.1 The simpler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Using Newton’s 3rd law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8 Spatial-decomposition algorithm 408.1 The communication scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408.2 The simpler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.2.1 Communication scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 438.2.2 Computational scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 448.2.3 Data structures’ update . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.3 Using Newton’s 3rd law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.4 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

References 47

3

1 INTRODUCTION

1 Introduction

1.1 The role of computer experimentsComputer experiments play a very important role in science today. In the past physical scienceswere characterized by an interplay between experiment and theory. In experiment, a system issubjected to measurements, and results, expressed in numeric form, are obtained. In theory,a model of the system is constructed, usually in the form of a set of mathematical equations.The model is then validated by its ability to describe the system behavior in a few selectedcases, simple enough to allow a solution to be computed from the equations. In many cases,this implies a considerable amount of simplification in order to eliminate all the complexitiesinvariably associated with real world problems, and make the problem solvable.

In the past, theoretical models could be easily tested only in a few simple special circum-stances. So, for instance, in condensed matter physics a model for intermolecular forces in aspecific material could be verified in a diatomic molecule, or in a perfect, infinite crystal. Eventhen, approximations were often required to carry out the calculation. Unfortunately, many phys-ical problems of extreme interest (both academic and practical) fall outside the realm of these”special circumstances.” Among them, one could mention the physics and chemistry of organicmolecules, involving a large amount of degrees of freedom.

The advent of high speed computers — which started to be used in the 50s — altered thepicture by inserting a new element right in between experiment and theory: the computer ex-periment. In a computer experiment, a model is still provided by theorists, but the calculationsare carried out by the machine by following a ”recipe” (the algorithm, implemented in a suitableprogramming language). In this way, complexity can be introduced (still with caution!) andmore realistic systems can be investigated, opening a road towards a better understanding of realexperiments.

Needless to say, the development of computer experiments altered substantially the tradi-tional relationship between theory and experiment. On one side, computer simulations increasedthe demand for accuracy of the models. For instance a molecular dynamics simulation allows usto evaluate the melting temperature of a material, modeled by means of a certain interaction law.This is a difficult test for the theoretical model to pass—and a test that has not been available inthe past. Therefore, simulation ”brings to life” the models, disclosing critical areas and providingsuggestions to improve them.

On the other side, simulation can often come very close to experimental conditions, to the ex-tent that computer results can sometimes be compared directly with experimental results. Whenthis happens, simulation becomes an extremely powerful tool not only to understand and inter-pret the experiments at the microscopic level, but also to study regions which are not accessibleexperimentally, or which would imply very expensive experiments, such as under extremely highpressure.

Last but not least, computer simulations allow thought experiments — things which are justimpossible to do in reality, but whose outcome greatly increases our understanding of phenomena— to be realized. Fantasy and creativity are important qualities for the computer simulator!

4

1.2 What is molecular dynamics? 1 INTRODUCTION

1.2 What is molecular dynamics?We call molecular dynamics (MD) a computer simulation technique where the time evolutionof a set of interacting atoms is followed by integrating their equations of motion. In moleculardynamics we follow the laws of classical mechanics, and most notably the Newton’s 2nd law:

Fi = miai (1)

for each atom i in a system constituted by N atoms. Here, mi is the atom mass, ai = d2ri/dt2 its

acceleration, and Fi the force acting upon it, due to the interactions with other atoms. Therefore,molecular dynamics is a deterministic technique: given an initial set of positions and velocities,the subsequent time evolution is in principle1 completely determined. In more pictorial terms,atoms will ”move” into the computer bumping into each other wandering around (if the systemis fluid), oscillating in waves in concert with their neighbors, and so on, in a way pretty similarto what atoms in a real substance would do.

The computer calculates a trajectory in a 6-N dimensional phase space (3N positions and3N momenta). However, such trajectory is usually not particularly relevant by itself. Moleculardynamics is a statistical mechanics method: it is a way to obtain a set of configurations distributedaccording to some statistical distribution function or statistical ensemble. According to statisticalphysics, physical quantities are represented by averages over configurations distributed accordingto a certain statistical ensemble. A trajectory obtained by molecular dynamics provides such aset of configurations. Therefore, a measurement of a physical quantity by simulation is simplyobtained as an arithmetic average of the various instantaneous values assumed by that quantityduring the MD run.

Statistical physics is the link between the microscopic behavior and thermodynamics. In thelimit of very long simulation times, one could expect the phase space to be fully sampled andin that limit this averaging process would yield the thermodynamic properties. In practice, theruns are always of finite length and one should exert caution to estimate when the sampling maybe good (”system at equilibrium”) or not. In this way, MD simulations can be used to measurethermodynamic properties.

1.3 Some historical notesMolecular dynamics is concerned with simulating the motion of molecules to gain a deeperunderstanding of several physical phenomena that derive from molecular interactions. Thesestudies include not only the motion of many molecules as in a fluid, but also the motion of asingle large molecule consisting of hundreds or thousands of atoms, as in a protein.

Computers are a critically important tool for these studies because there simply is no otherway to trace the motion of a large number of interacting particles. The earliest of these com-putations were done in the 1950s by Berni Alder and Tom Wainwright at Lawrence LivermoreNational Laboratory [1]. They studied the distribution of molecules in a liquid, using a model in

1In practice, the finiteness of the integration time step and arithmetic rounding errors will eventually cause thecomputed trajectory to deviate from the true trajectory.

5

1.3 Some historical notes 1 INTRODUCTION

which the molecules are represented as ”hard spheres” that interact like billiard balls. Using thefastest computer at that time, an IBM 704, they were able to simulate the motions of 32 and 108molecules in computations requiring 10 to 30 hours. Now it is possible to perform hard spherecomputations on systems of over a billion particles.

Another crucial paper appeared in 1967 [2]. Loup Verlet introduced the concept of neighborlist and the Verlet time integrator algorithm, which we will meet in the next lessons, to cal-culate the phase diagram of argon and to compute correlation functions to test theories of theliquid state. Verlet studied a molecular model more realistic than the hard sphere one, known asLennard-Jones model (the same model we will study in this course).

The Lennard-Jones model was employed also by Lawrence Hannon, George Lie, and EnricoClimenti in 1986 at IBM Kingston to study the flow of fluids [3]. In these computations the fluidwas represented by ∼ 104 interacting molecules. Even though this is miniscule compared withthe number of molecules in a gram of water, the behavior of the flow was like that in a real fluid.

Another class of molecular dynamics computations is concerned with the internal motion ofmolecules especially proteins and nucleic acids, vital components of biological systems. Thegoal in this case is to gain a better understanding of the function of these molecules in biochem-ical reactions. Interestingly, it has been found that quantum effects do not have much influenceon the overall dynamics of proteins except at low temperatures. Thus, classical mechanics is suf-ficient to model the motions, but still the computational power required for following the motionof a large molecule is enormous. For example, simulating the motion of a 1,500 atom molecule,a small protein, for a time interval of 10−7 seconds is a 6-hour computation on a modern IntelXeon quadcore processor.

Martin Karplus and Andrew McCammon, in an interesting article ”The Molecular Dynamicsof Proteins” [4] describe a discovery concerning the molecule myoglobin that could only havebeen made through molecular dynamics. The interest in myoglobin comes about because it storesoxygen in biological systems. Whales, for example, are able to stay under water for long periodsof time because of a large amount of myoglobin in their bodies. It was known that the oxygenmolecule binds to a particular site in the myoglobin molecule but it was not understood how thebinding could take place. X ray crystallography work shows that large protein molecules tend tobe folded up into compact three dimensional structures, and in the case of myoglobin the oxygensites were known to lie in the interior of such a structure. There did not seem to be any waythat an oxygen atom could penetrate the structure to reach the binding site. Molecular dynamicsprovided the answer to this puzzle. The picture of a molecule provided by X ray crystallographyshows the average positions of the atoms in the molecule. In fact these atoms are in continuousmotion, vibrating about positions of equilibrium. Molecular dynamics simulation of the internalmotion of the molecule showed that a path to the site, wide enough for an oxygen atom, couldopen up for a short period of time.

Scientific visualization is particularly important for understanding the results of a moleculardynamics simulation. The millions of numbers representing a history of the positions and ve-locities of the particles is not a very revealing picture of the motion. How can one recognizethe formation of a vortex in this mass of data or the nature of the bending and stretching of alarge molecule? How can one gain new insights? Pictures and animations enable the scientist toliterally see the formation of vortices, to view protein bending, and thus to gain new insights into

6

1.4 Limitations 1 INTRODUCTION

the details of these phenomena.Computations that involve following the motion of a large number of interacting particles,

whether the particles are atoms in a molecule, or molecules of a fluid or solid or even particles inthe discrete model of a vibrating string or membrane are similar in the following respect. Theyinvolve a long series of time steps, at each of which Newton’s laws are used to determine the newpositions and velocities from the old positions and velocities and the forces. The computationsare quite simple but there are many of them. To achieve accuracy, the time steps must be quitesmall, and therefore many are required to simulate a significantly long real time interval. Incomputational chemistry the time step is typically about a femtosecond (10−15 seconds), and thetotal computation may represent ∼ 10−8 seconds which could cost about 100 hours of computertime. The amount of computation at each step can be quite extensive. In a system consistingof N particles the force computations at each step may involve O(N2) operations. Thus it iseasy to see that these computations can consume a large number of machine cycles. In addition,animation of the motion of the system can make large demands on memory.

1.4 LimitationsMolecular dynamics is a very powerful technique but has — of course — limitations. We quicklyexamine the most important of them.

1.4.1 Use of classical forces

One could immediately ask: how can we use Newton’s law to move atoms, when everybodyknows that systems at the atomistic level obey quantum laws rather than classical laws, and thatSchrodinger’s equation is the one to be followed?

A simple test of the validity of the classical approximation is based on the de Broglie thermalwavelength [5] defined as

Λ =

√2π~2

MkBT(2)

where M is the atomic mass and T the temperature. The classical approximation is justified ifΛ� a, where a is the mean nearest neighbor separation. If one considers for instance liquids atthe triple point, Λ/a is of the order of 0.1 for light elements such as Li and Ar, decreasing furtherfor heavier elements. The classical approximation is poor for very light systems such as H2, He,Ne.

Moreover, quantum effects become important in any system when T is sufficiently low. Thedrop in the specific heat of crystals below the Debye temperature [6] or the anomalous behaviorof the thermal expansion coefficient, are well known examples of measurable quantum effects insolids.

Molecular dynamics results should be interpreted with caution in these regions.

1.4.2 Realism of forces

How realistic is a molecular dynamics simulation?

7

1.4 Limitations 1 INTRODUCTION

In molecular dynamics, atoms interact with each other. These interactions originate forcesthat act upon atoms, and atoms move under the action of these instantaneous forces. As the atomsmove, their relative positions change and forces change as well.

The essential ingredient containing the physics is therefore constituted by the forces. Asimulation is realistic, i.e. it mimics the behavior of the real system, only to the extent thatinteratomic forces are similar to those that real atoms (or, more exactly, nuclei) would experiencewhen arranged in the same configuration.

As we will see in the next section, forces are usually obtained as the gradient of a poten-tial energy function depending on the positions of the particles. The realism of the simulationtherefore depends on the ability of the potential chosen to reproduce the behavior of the materialunder the conditions at which the simulation is run.

The problem of selecting, or constructing, potentials (the so-called force-field problem) isstill under study by the scientific community and it is beyond the scope of these short notes.

1.4.3 Time and size limitations

Typical MD simulations can nowadays be performed on systems containing hundred thousands,or, perhaps, millions of atoms, and for simulation times ranging from a few nanoseconds to morethan one microsecond. While these numbers are certainly respectable, it may happen to run intoconditions where time and/or size limitations become important.

A simulation is ”safe” from the point of view of its duration when the simulation time is muchlonger than the relaxation time of the quantities we are interested in. However, different prop-erties have different relaxation times. In particular, systems tend to become slow and sluggishin the proximity of phase transitions, and it is not uncommon to find cases where the relaxationtime of a physical property is orders of magnitude larger than times achievable by simulation.

A limited system size can also constitute a problem. In this case one has to compare the sizeof the MD cell with the correlation lengths of the spatial correlation functions of interest. Again,correlation lengths may increase or even diverge in proximity of phase transitions, and the resultsare no longer reliable when they become comparable with the box length.

This problem can be partially alleviated by a method known as finite size scaling. This consistof computing a physical property A using several box with different sizes L, and then fitting theresults on a relation

A(L) = A0 +c

Ln

using A0, c, n as fitting parameters. A0 then corresponds to limL→∞A(L), and should thereforebe taken as the most reliable estimate for the ”true” physical quantity.

8

2 THE BASIC MACHINERY

2 The basic machinery

2.1 Modeling the physical systemThe main ingredient of a simulation is a model for the physical system. For a molecular dynamicssimulation this amounts to choosing the potential, a function U(r1, . . . , rN) of the positions ofthe nuclei, representing the potential energy of the system when the atoms are arranged in thatspecific configuration. This function is translationally and rotationally invariant, and is usuallyconstructed from the relative positions of the atoms with respect to each other, rather than fromthe absolute positions.

Forces are then derived as the gradients of the potential with respect to atomic displacements:

Fi = ∇riU(r1, . . . , rN) (3)

This form implies the presence of a conservation law of the total energy E = K +U where K isthe instantaneous kinetic energy:

K =∑i

1

2miv

2i (4)

where vi is the velocity of the particle i.The simplest choice for U is to write it as a sum of pairwise interactions:

U(r1, . . . , rN) =∑i

∑j>i

φ(|ri − rj|) (5)

The clause j > i in the second summation has the purpose of considering each atom pair onlyonce. In the past most potentials were constituted by pairwise interactions, but this is no longerthe case. It has been recognized that the two-body approximation is very poor for many relevantsystems, such as metals and semiconductors. Various kinds of many-body potentials are now ofcommon use in condensed matter simulation. The development of accurate potentials representsan important research line.

In this course, however, we will take into account only the most commonly used pairwiseinteraction model: the Lennard-Jones pair potential, which has been proved to correctly describemonoatomic liquid systems like for example liquid Argon, as we will see in the next section.

2.1.1 Lennard-Jones potential

The Lennard-Jones 12-6 potential (LJ) is given by the expression

φLJ(r) = 4ε

[(σr

)12

−(σr

)6]

(6)

for the interaction potential between a pair of atoms. The total potential of a system containingmany atoms is then given by (10). This potential has an attractive (i.e. a negative value sincewe put the arbitrary zero of the energy at the value of the potential when the to particles are at

9

2.1 Modeling the physical system 2 THE BASIC MACHINERY

Figure 1: The Lennard-Jones 12-6 potential given by (6).

an infinite distance apart) tail at large r, it reaches a minimum around 1.122σ, and it is stronglyrepulsive at shorter distance, passing through 0 at r = σ and increasing steeply as r is decreasedfurther (Fig. 1).

The term ∼ 1/r12 dominating at short distance, models the repulsion between atoms whenthey are brought very close to each other. Its physical origin is related to the Pauli principle:when the electronic clouds surrounding the atoms starts to overlap, the energy of the systemincreases abruptly. The exponent 12 was chosen exclusively on a practical basis: equation (6)is particularly easy to compute. In fact, on physical grounds an exponential behavior would bemore appropriate.

The term ∼ 1/r6 dominating at large distance, constitute the attractive part. This is the termthat gives cohesion to the system. A 1/r6 attraction is originated by van der Waals dispersionforces, originated by dipole-dipole interactions in turn due to fluctuating dipoles. These are ratherweak interactions, which however dominate the bonding character of closed-shell systems, thatis, rare gases such as Ar or Kr. Therefore, these are the materials that a LJ potential could mimicfairly well2. The parameters ε and σ are chosen to fit the physical properties of the material.

On the other hand, a LJ potential is not at all adequate to model situations with open shells,where strong localized bonds may form (as in covalent systems), or where there is a delocalized

2Many-body effects in the interaction are however present also in these systems and, in all cases, pair potentialsmore accurate than LJ have been developed for rare gases.

10

2.1 Modeling the physical system 2 THE BASIC MACHINERY

”electron sea” where the ions sit (as in metals). In these systems the two-body interactionsscheme itself fails very badly. We will not deal with these cases further.

However, regardless of how well it is able to model actual physical systems, the LJ 12-6 potential constitutes nowadays an extremely important model system. There is a vast bodyof papers that investigated the behavior of atoms interacting via LJ on a variety of differentgeometries (solids, liquids, surfaces, clusters, two-dimensional systems, etc). One could say thatLJ is the standard potential to use for all the investigations where the focus is on fundamentalissues, rather than studying the properties of a specific material. The simulation work done on LJsystems helped us (and still does) to understand basic points in many areas of condensed matterphysics and consequently of the biophysics, and for this reason the importance of LJ cannot beunderestimated.

When using the LJ potentials in simulation, it is customary to work in a system of units whereσ = 1 and ε = 1. All the codes in this course follow this convention.

2.1.2 Potential truncation and long-range corrections

The potential in eq. (6) has an infinite range. In practical applications, it is customary to establisha cutoff radius Rc and disregard the interactions between atoms separated by more than Rc. Thisresults in simpler programs and enormous savings of computer resources, because the number ofatomic pairs separated by a distance r grows as r2 and becomes quickly huge.

A simple truncation of the potential creates a new problem though: whenever a particle pair”crosses” the cutoff distance, the energy makes a little jump. A large number of these events islikely to spoil energy conservation in a simulation. To avoid this problem, the potential is oftenshifted in order to vanish at the cutoff radius3:

φ(r) =

{φLJ(r)− φLJ(Rc) if r ≤ Rc

0 if r > Rc

(7)

Physical quantities are of course affected by the potential truncation. The effects of truncat-ing a full-ranged potential can be approximately estimated by treating the system as a uniform(constant density) continuum beyond Rc. For a bulk system (periodic boundary conditions alongeach direction, see next section) this usually amounts to a constant additive correction. For ex-ample, the potential tail (attractive) brings a small additional contribution to the cohesive energy,and to the total pressure.

Commonly used truncation radii for the LJ potential are 2.5σ and 3.2σ. It should be men-tioned that these truncated Lennard-Jones models are so popular that they acquired a value ontheir own as reference models for generic two-body systems (as also discussed at the end of theprevious section). In many cases, there is no interest in evaluating truncation corrections becausethe truncated model itself is the subject of the investigation. This is not our case, and we will uselong-range correction terms in our code. Moreover, we choose to adopt the smaller commonlyused truncation radius (2.5σ).

3Shifting eliminates the energy discontinuity, but not the force discontinuity. Some researchers altered the formof the potential near Rc to obtain a smoother junction, but there is no standard way to do that.

11

2.2 Periodic boundaries conditions 2 THE BASIC MACHINERY

2.2 Periodic boundaries conditionsWhat should we do at the boundaries of our simulated system? One possibility is doing nothingspecial: the system simply terminates, and atoms near the boundary would have less neighborsthan atoms inside. In other words, the sample would be surrounded by surfaces.

Unless we really want to simulate a cluster of atoms, this situation is not realistic. No matterhow large the simulated system is, its number of atoms N would be negligible compared withthe number of atoms contained in a macroscopic piece of matter (of the order of 1023) and theratio between the number of surface atoms and the total number of atoms would be much largerthan in reality, causing surface effects to be much more important than what they should.

A solution to this problem is to use periodic boundary conditions (PBC). When using PBC,particles are enclosed in a box, and we can imagine that this box is replicated to infinity by rigidtranslation in all the three Cartesian directions, completely filling the space. In other words, if oneof our particles is located at position r in the box, we assume that this particle really representsan infinite set of particles located at

r + la +mb + nc, (l,m, n = −∞,∞),

where l,m, n are integer numbers, and we have indicated with a,b, c the vectors correspondingto the edges of the box. All these ”image” particles move together, and in fact only one of themis represented in the computer program.

The key point is that now each particle i in the box should be thought as interacting not onlywith other particles j in the box, but also with their images in nearby boxes. That is, interactionscan ”go through” box boundaries. In fact, one can easily see that (a) we have virtually eliminatedsurface effects from our system, and (b) the position of the box boundaries has no effect (i.e. atranslation of the box with respect to the particles leaves the forces unchanged).

Apparently, the number of interacting pairs increases enormously as an effect of PBC. Inpractice, this is not true because potentials, like LJ one, usually have a short interaction range.The minimum image criterion discussed next simplifies things further, reducing to a minimumthe level of additional complexity introduced in a program by the use of PBC.

Finally, it could be worth to point out that even if PBC allow us to describe an infinite systemof particles that approach more and more a real system, the PBC system is still not an exact rep-resentation of reality: in fact, PBC introduce an artificial symmetry, the translational invariance,not present in the real system.

2.2.1 The minimum image criterion

Let us suppose that we are using a potential with a finite range, like LJ one: when separated bya distance equal or larger than a cutoff distance Rc two particles do not interact with each other.Let us also suppose that we are using a box whose size is larger than 2Rc along each Cartesiandirection.

When these conditions are satisfied, it is obvious that at most one among all the pairs formedby a particle i in the box and the set of all the periodic images of another particle j will interact.

12

2.3 Time integration algorithm 2 THE BASIC MACHINERY

To demonstrate this, let us suppose that i interacts with two images j1 and j2 of j. Twoimages must be separated by the translation vector bringing one box into another, and whoselength is at least 2Rc by hypothesis. In order to interact with both j1 and j2, i should be within adistance Rc from each of them, which is impossible since they are separated by more than 2Rc.

When we are in these conditions, we can safely use is the minimum image criterion: amongall possible images of a particle j, select the closest, and throw away all the others. In fact, onlythe closest is a candidate to interact: all the others certainly do not.

This operating conditions greatly simplify the set up of a MD program, and are commonlyused. Of course, one must always make sure that the box size is at least 2Rc along all thedirections where PBCs are in effect.

2.3 Time integration algorithmThe engine of a molecular dynamics program is its time integration algorithm, required to inte-grate the equation of motion of the interacting particles and follow their trajectory. Time inte-gration algorithms are based on finite difference methods, where time is discretized on a finitegrid, the time step ∆t being the distance between consecutive points on the grid. Knowing thepositions and some of their time derivatives at time t (the exact details depend on the type ofalgorithm), the integration scheme gives the same quantities at a later time t + ∆t. By iteratingthe procedure, the time evolution of the system can be followed for long times.

Of course, these schemes are approximate and there are errors associated with them. Inparticular, one can distinguish between

• Truncation errors, related to the accuracy of the finite difference method with respect to thetrue solution. Finite difference methods are usually based on a Taylor expansion truncatedat some term, hence the name. These errors do not depend on the implementation: theyare intrinsic to the algorithm.

• Round-off errors, related to errors associated to a particular implementation of the algo-rithm. For instance, to the finite number of digits used in computer arithmetics.

Both errors can be reduced by decreasing ∆t. For large ∆t truncation errors dominate, butthey decrease quickly as ∆t is decreased. For instance, the Verlet algorithm discussed in thenext section has a truncation error proportional to ∆t4 for each integration time step. Round-offerrors decrease more slowly with decreasing ∆t, and dominate in the small ∆t limit. Using 64-bit precision (corresponding to ”double precision” on the majority of today’s workstations) helpsto keep round-off errors at a minimum.

The popular Verlet algorithm integration method, which will be adopted in the codes of thiscourse, is presented in the section below. For more detailed information on time integrationalgorithms, the interested reader is referred to refs. [7] for a general survey, and to ref. [8] for adeeper analysis.

13

2.4 The Verlet algorithm 2 THE BASIC MACHINERY

2.4 The Verlet algorithmIn molecular dynamics, the most commonly used time integration algorithm is probably theso-called Verlet algorithm [2]. The basic idea is to write two third-order Taylor expansionsfor the positions r(t), one forward and one backward in time. Calling v the velocities, a theaccelerations, and b the third derivatives of r with respect to t, one has:

r(t+ ∆t) = r(t) + v(t)∆t+ (1/2)a(t)∆t2 + (1/6)b(t)∆t3 +O(∆t4) (8)r(t−∆t) = r(t)− v(t)∆t+ (1/2)a(t)∆t2 − (1/6)b(t)∆t3 +O(∆t4) (9)

Adding the two expressions gives

r(t+ ∆t) = 2r(t)− r(t−∆t) + a(t)∆t2 +O(∆t4) (10)

This is the basic form of the Verlet algorithm. Since we are integrating Newton’s equations, a(t)is just the force divided by the mass, and the force is in turn a function of the positions r(t):

a(t) = −(1/m)∇U(r(t)) (11)

As one can immediately see, the truncation error of the algorithm when evolving the system by∆t is of the order of ∆t4, even if third derivatives do not appear explicitly. This algorithm is atthe same time simple to implement, accurate and stable, explaining its large popularity amongmolecular dynamics simulators.

2.4.1 Velocity Verlet algorithm

A problem with this version of the Verlet algorithm is that velocities are not directly generated.While they are not needed for the time evolution, their knowledge is sometimes necessary. More-over, they are required to compute the kinetic energy K, whose evaluation is necessary to testthe conservation of the total energy E = K + U 4. This is one of the most important tests toverify that a MD simulation is proceeding correctly. One could compute the velocities from thepositions by using

v(t) =r(t+ ∆t)− r(t−∆t)

2∆t(12)

However, the error associated to this expression is of order ∆t2 rather than ∆t4. In fact, from (8)and (9) we have:

v(t) =r(t+ ∆t)− r(t−∆t)

2∆t+ (1/6)b(t)∆t2 +O(∆t3) (13)

To overcome this difficulty, some variants of the Verlet algorithm have been developed. Theygive rise to exactly the same trajectory, and differ in what variables are stored in memory andat what times. For example, an even better implementation of the same basic algorithm is the

4One of the most important features that makes the Verlet algorithm so popular is the fact that the conservationof energy is exactly respected, even at large time steps.

14


so-called velocity Verlet scheme, where positions, velocities and accelerations at time t+ ∆t areobtained from the same quantities at time t in the following way:

r(t+ ∆t) = r(t) + v(t)∆t+1

2a(t)∆t2

v(t+∆t

2) = v(t) +

1

2a(t)∆t

a(t+ ∆t) = −(1/m)∇U(r(t+ ∆t))

v(t+ ∆t) = v(t+∆t

2) +

1

2a(t+ ∆t)∆t

Note how we need 9N memory locations to save the N positions, velocities and accelerations,but we never need to have simultaneously stored the values at two different times for any one ofthese quantities.

2.4.2 Leap-frog algorithm

In the codes developed in this course another variant of the Verlet algorithm will be employed,which is very similar to the previous one: the so-called leap-frog algorithm. Leap-frog integra-tion is equivalent to calculating positions and velocities at interleaved time points, interleaved insuch a way that they ”leapfrog” over each other. For example, the position is known at integertime steps and the velocity is known at integer plus half time steps. The equations for leap-frogalgorithm can be written

r(t+ ∆t) = r(t) + v(t+∆t

2)∆t (14)

v(t+∆t

2) = v(t− ∆t

2) + a(t)∆t. (15)

Note that at the start (t = 0) of the leap-frog iteration, at the first step t = ∆t/2, computingv(t+ ∆t

2), one needs the velocitiy at t = −∆t/2. At first sight this could give problems, because

the initial conditions are known only at the initial time (t=0). Usually, the initial velocity is usedin place of the v(−∆t) term: the error on the first time step calculation then is of order O(∆t2).This is not considered a problem because on a simulation of over a large amount of time steps,the error on the first time step is only a negligible small amount of the total error, which at timet = n∆t is of the order O(eLn∆t∆t2).

2.4.3 Error terms

The local error in position of the Verlet integrator is O(∆t4) as described above, and the localerror in velocity is O(∆t2).

The global error in position, in contrast, is O(∆t2) and the global error in velocity is O(∆t2).These can be derived by noting the following:

error(r(t0 + ∆t)) = O(∆t4)

15


and from (10)

r(t0 + 2∆t)) = 2r(t0 + ∆t))− r(t0) + a(t0 + ∆t)∆t2 +O(∆t4).

Thereforeerror(r(t0 + 2∆t)) = 2 error(r(t0 + ∆t)) +O(∆t4) = 3O(∆t4).

In deriving the above equation, pay attention that the errors on r(t0 + ∆t)) and r(t0) have not tobe summed, as one does when he deals with ”random” errors, but you have to keep the minussign since the error of the first term depends on the error of the second one according a precisefunction and consequently they goes always to ”the same direction”. Similarly at the previouscalculation, we have:

error(r(t0 + 3∆t)) = 6O(∆t4)

error(r(t0 + 4∆t)) = 10O(∆t4)

error(r(t0 + 5∆t)) = 15O(∆t4),

which can be generalized by induction to

error(r(t0 + n∆t)) =n(n+ 1)

2O(∆t4).

If we consider the global error in position between r(t) and r(t+ T ), where T = n∆t, it is clearthat

error(r(t0 + T )) =

(T 2

2∆t2+

T

2∆t

)O(∆t4),

and therefore, the global (cumulative) error over a constant interval of time is given by

error(r(t0 + T )) = O(∆t2).

Because the velocity is determined in a non-cumulative way5 from the positions in the Verletintegrator, the global error in velocity is also O(∆t2).

In molecular dynamics simulations, the global error is typically far more important than thelocal error, and the Verlet integrator is therefore known as a second-order integrator.

5In the Verlet algorithm the velocity plays no part in the integration of the equations of motions. However,velocities often are necessary for the calculation of certain physical quantities like the kinetic energy. This deficiencycan either be dealt with using the Velocity Verlet (or Leap-frog) algorithm, or estimating the velocity using theposition terms by the equation (12), which has a local error O(∆t2). Since from this equation velocity depends onlyon the positions and not on the velocities evaluated at previous times, the ”non-cumulative way” expression is usedfor the determination of the velocity in the Verlet algorithm.

16

3 SIMPLE STATISTICAL QUANTITIES

3 Simple statistical quantitiesMeasuring quantities in molecular dynamics usually means performing time averages of physicalproperties over the system trajectory. Physical properties are usually a function of the particlecoordinates and velocities. So, for instance, one can define the instantaneous value of a genericphysical property A at time t:

A(t) = f(r1(t), . . . , rN(t),v1(t), . . . ,vN(t))

and then obtain its average

〈A〉 =1

NT

NT∑t=1

A(t)

where t is an index which runs over the time steps from 1 to the total number of steps NT .There are two equivalent ways to do this in practice:

1. A(t) is calculated at each time step by the MD program while running. The sum∑

tA(t)is also updated at each step. At the end of the run the average is immediately obtained bydividing by the number of steps. This is the preferred method when the quantity is simpleto compute and/or particularly important. An example is the system temperature.

2. Positions (and possibly velocities) are periodically dumped in a ”trajectory file” while theprogram is running. A separate program, running after the simulation program, processesthe trajectory and computes the desired quantities. This approach can be very demandingin terms of disk space: dumping positions and velocities using 64-bit precision takes 48bytes per step and per particle. However, it is often used when the quantities to computeare complex, or combine different times as in dynamical correlations, or when they aredependent on other additional parameters that may be unknown when the MD program isrun. In these cases, the simulation is run once, but the resulting trajectory can be processedover and over.

By our codes we will give examples of the both strategies.

3.1 Potential energyThe average potential energy U is obtained by averaging its instantaneous value, which is usuallyobtained straightforwardly at the same time as the force computation is made. For instance, inthe case of two-body interactions

U(t) =∑i

∑j>i

φ(|ri(t)− rj(t)|) (16)

Even if not strictly necessary to perform the time integration (forces are all is needed), knowl-edge of U is required to verify energy conservation. This is an important check to do in any MDsimulation.

17

3.2 Kinetic energy 3 SIMPLE STATISTICAL QUANTITIES

We mentioned in the previous chapter that when one deals with short-range interactions, thepotential and force are usually truncated at some cutoff distance Rc for computational expedi-ency. Consequently the effective potential in our case will be:

φ(rij) =

{φLJ(rij) if rij ≤ Rc

0 if rij > Rc

(17)

Such truncation implies that certain physical quantities like potential energy and pressure hasto be corrected in order to obtain a value closer to the value one would get if he takes intoaccount the exact potential. An approximate correction can be obtained by assuming the spatialcorrelations beyond the cutoff distance are unity. The student is referred to Refs. [7] and [9] forthese so-called ”standard long-range corrections” (sLRC). It should be noted that this is not agood assumption in inhomogeneous fluids. We will use the following expression for the long-range correction to be applied at each term of the outermost sum in (16):

φLRC =1

24πρ

∫ ∞Rc

drij r2ijφLJ(rij) = 8πρεσ3

[1

9

(σ

Rc

)9

− 1

3

(σ

Rc

)3]

(18)

where ρ is the bulk number density (N/V ).

3.2 Kinetic energyThe instantaneous kinetic energy is of course given by

K(t) =1

2

∑i

mi[vi(t)]2 (19)

and is therefore extremely easy to compute. We will call K its average on the run.

3.3 Total energyThe total energy E = K + U is a conserved quantity in Newtonian dynamics. However, it iscommon practice to compute it at each time step in order to check that it is indeed constant withtime. In other words, during the run energy flows back and forth between kinetic and potential,causing K(t) and U(t) to fluctuate while their sum remains fixed.

In practice, there could be small fluctuations in the total energy, in a typical amount of, say,one part in 104 or less. These fluctuations are usually caused by errors in the time integration (seesection 2.3), and can be reduced in magnitude by reducing the time step if considered excessive.Ref. [8] contains an in-depth analysis of total energy fluctuations using various time integrationalgorithms.

Slow drifts of the total energy are also sometimes observed in very long runs. Such driftscould also be originated by an excessive ∆t. Drifts are more disturbing than fluctuations becausethe thermodynamic state of the system is also changing together with the energy, and therefore

18

3.4 Temperature 3 SIMPLE STATISTICAL QUANTITIES

time averages over the run do not refer to a single state. If drifts over long runs tend to occur,they can be prevented, for instance by breaking the long run into smaller pieces and restoring theenergy to the nominal value between one piece and the next. A common mechanism to adjustthe energy is to modify the kinetic energy via rescaling of velocities.

A final word of caution: while one may be tempted to achieve ”perfect” energy conservationby reducing the time step as much as desired, working with an excessively small time step mayresult in waste of computer time. A practical compromise would probably allow for small energyfluctuations and perhaps slow energy drifts, as a price to pay to work with a reasonably large ∆t.See also [7].

3.4 TemperatureThe temperature T is directly related to the kinetic energy by the well-known equipartition for-mula, assigning an average kinetic energy kBT/2 per degree of freedom:

K =3

2NkBT (20)

An estimate of the temperature is therefore directly obtained from the average kinetic energyK (see 3.2). For practical purposes, it is also common practice to define an ”instantaneoustemperature” T (t), proportional to the instantaneous kinetic energy K(t) by a relation analogousto the one above.

3.5 Radial distribution functionOne of the most important static properties that can be used to characterize liquids in generaland LJ liquids in particular is the radial distribution function (rdf) so defined:

g(r) =1

ρ4πr2dr

∑ij

〈δ(r − |ri − rj|)〉 (21)

where δ(x) is the function: δ(0) = 1, and δ(x) = 0 for x 6= 0. The rdf is important for threereasons:

1. for pairwise additive potentials as LJ one, knowledge of the rdf is sufficient information tocalculate thermodynamic properties, particularly the energy and pressure;

2. there are very well developed integral-equation theories that permit estimation of the rdffor a given molecular model;

3. the rdf can be measured experimentally, using neutron-scattering techniques.

19

3.6 MSD and VACF 3 SIMPLE STATISTICAL QUANTITIES

3.6 Mean square displacement and velocity autocorrelation functionTwo of the most important dynamic properties that are used to characterize liquids are mean-square displacement (MSD) and velocity autocorrelation function (VACF).

The MSD of atoms in a simulation can be easily computed by its definition

∆r2(t) = 〈|r(t)− r(0)|2〉 (22)

where 〈. . . 〉 denotes here averaging over all the atoms (or all the atoms in a given subclass). Caremust be taken to avoid considering the ”jumps” of particles to refold them into the box whenusing periodic boundary conditions as contributing to diffusion.

The MSD contains information on the atomic diffusivity. If the system is solid, MSD sat-urates to a finite value, while if the system is liquid, MSD grows linearly with time. In thiscase it is useful to characterize the system behavior in terms of the slope, which is the diffusioncoefficient D:

D = limt→∞

1

2dt〈|r(t)− r(0)|2〉 (23)

where d is the dimensionality of the system (usually 2 or 3).The VACF is defined as

C(t) = 〈v(t)v(0)〉 (24)

and it is related to D by the equation:

D =1

d

∫ ∞0

dt〈v(t) · v(0)〉 (25)

3.7 PressureThe measurement of the pressure in a molecular dynamics simulation is based on the Clausiusvirial function

W (r1, . . . , rN) =N∑i=1

ri · FTOTi (26)

where FTOTi is the total force acting on atom i. Its statistical average 〈W 〉 will be obtained, as

usual, as an average over the molecular dynamics trajectory:

〈W 〉 = limt→∞

1

t

∫ t

0

dτN∑i=1

ri(τ) ·miai(τ)

where use has been made of Newton’s law (1). Integrating by parts,

〈W 〉 = − limt→∞

1

t

∫ t

0

dτN∑i=1

mi|vi(τ)|2.

This is twice the average kinetic energy, therefore by the equipartition law of statistical mechanics

〈W 〉 = −dNkBT (27)

20

3.7 Pressure 3 SIMPLE STATISTICAL QUANTITIES

where kB the Boltzmann constant.Now, one may think of the total force acting on a particle as composed of two contributions:

FTOTi = Fi + FEXT

i (28)

where Fi is the internal force (arising from the interatomic interactions), and FEXTi is the external

force exerted by the container’s walls. If the particles are enclosed in a parallelepipedic containerof sides Lx, Ly, Lz, volume V = LxLyLz, and with the coordinates origin on one of its corners,the part 〈WEXT 〉 due to the container can be evaluated using the definition (26):

〈WEXT 〉 = Lx(−PLyLz) + Ly(−PLxLz) + Lz(−PLxLy) = −dPV

where −PLyLz is, for instance, the external force FEXTx applied by the yz wall along the x

directions to particles located at x = Lx, etc. Eq. (27) can then be written⟨N∑i=1

ri · Fi

⟩− dPV = −dNkBT

or

PV = NkBT +1

d

⟨N∑i=1

ri · Fi

⟩(29)

This important result is known as the virial equation. All the quantities except the pressure Pare easily accessible in a simulation, and therefore (29) constitutes a way to measure P . Notehow (29) reduces to the well-known equation of state of the perfect gas if the particles are non-interacting.

In the case of pairwise interactions via a potential φ(r), it is left as an exercise to the readerto verify that (29) becomes

PV = NkBT −1

d

⟨∑i

∑j>i

rijdφ

dr

∣∣∣∣∣rij

⟩. (30)

This expression has the additional advantage over (29) to be naturally suited to be used when pe-riodic boundary conditions are present: it is sufficient to take them into account in the definitionof rij .

As for the potential energy, also the expression of the pressure has to be corrected if a trun-cated potential is used in the numerical simulation. By using the same assumption mentionedin the paragraph 3.1 one can derive the following correction to be applied at each term of theoutermost sum in the pressure formulas above corresponding to the contribution of each pair ij:

PLRC =1

24πρ2

∫ ∞Rc

drij r2ijrij · Fij = 8πρ2εσ3

[4

9

(σ

Rc

)9

− 2

3

(σ

Rc

)3]

(31)

21

4 COMPUTATIONAL ASPECTS OF MD

4 Computational aspects of MD

4.1 MD algorithmsClassical MD is commonly used for simulating the properties of liquids solids and molecules.Each of the N atoms or molecules in the simulation is treated as a point mass and Newton’sequations are integrated to compute their motion. From the motion of the ensemble of atomsa variety of useful microscopic and macroscopic information can be extracted such as transportcoefficients, phase diagrams and structural or conformational properties. The physics of themodel is contained in a potential energy functional for the system from which individual forceequations for each atom are derived.

MD simulations are typically not memory intensive since only vectors of atom informationare stored. Computationally, the simulations are ”large” in two domains

1. the number of atoms

2. the number of time steps

The length scale for atomic coordinates is Angstroms; in three dimensions many thousands ormillions of atoms must usually be simulated to approach even the submicron scale. In liquids andsolids the time step size is constrained by the demand that the vibrational motion of the atomsbe accurately tracked. This limits time steps to the femtosecond scale and so tens or hundreds ofthousands of time steps are necessary to simulate even picoseconds of ”real” time. Because ofthese computational demands considerable effort has been expended by researchers to optimizeMD and even to build special-purpose hardware for performing MD simulations. The currentstate-of-the-art is such that simulating ten- to hundred-thousand atom systems for nanosecondstakes hours of CPU time on machines such as Jugene or Juropa supercomputer at JSC.

Since MD computations are inherently parallel, as we will see in the next pages, there hasbeen considerable effort in the last years by researchers to exploit this parallelism on variouskind of machines. The experience has proved that the message-passing model of programming6

for multiple-instruction/multiple-data (MIMD) parallel machines is the only one that providesenough flexibility to implement all the data structure and computational enhancements that arecommonly exploited in MD codes on serial (and vector) machines.

Several more and more efficient algorithms were developed to performed specific kinds ofclassical MD. In this course we will focus on some of such algorithms which are appropriate fora general class of MD problems that has two salient characteristics:

1. The first characteristic is that forces are limited in range, meaning each atom interactsonly with other atoms that are geometrically nearby. Solids and liquids are often modeledthis way due to electronic screening effects or simply to avoid the computational cost ofincluding long-range Coulombic forces. For short-range MD, the computational effort pertime step scales as N , the number of atoms.

6MPI is a library specification for message-passing model, become —de facto— the standard for message-passing model of programming.

22

4.2 Long- vs short-range force model 4 COMPUTATIONAL ASPECTS OF MD

2. The second characteristic is that the atoms can undergo large displacements over the dura-tion of the simulation. This could be due to diffusion in a solid or liquid, or conformationalchanges in a biological molecule. The important feature from a computational standpointis that each atom’s neighbors change as the simulation progresses. While the algorithmswe discuss could also be used for fixed-neighbor simulations (e.g. all atoms remain onlattice sites in a solid), it is a harder task to continually track the neighbors of each atomand maintain efficient O(N) scaling for the overall computation on a parallel machine.

Moreover, one wants the algorithms to work well on problems with small numbers of atoms,not just for large problems where parallelism is often easier to exploit. This is because the vastmajority of MD simulations are performed on systems of a few hundred to several thousandatoms where N is chosen to be as small as possible while still accurate enough to model thedesired physical effects. The computational goal in these calculations is to perform each timestep as quickly as possible. This is particularly true in non-equilibrium MD where macroscopicchanges in the system may take significant time to evolve, requiring millions of time steps tomodel. Thus, it is often more useful to be able to perform a 1,000,000 time step simulation of a1000 atom system fast rather than 1000 time steps of a 1,000,000 atom system, though the O(N)scaling means the computational effort is the same for both cases. To this end, we consider modelsizes as small as a few hundred atoms in this course.

On the other hand, for very large MD problems, it is important that the parallel algorithmsare scalable to larger and faster parallel machines, i.e. they scale optimally with respect to N andP (the number of processors).

4.2 Long- vs short-range force modelThe computational task in classical MD simulation is to integrate the set of coupled differentialequations (Newton’s equations) that in general can be written:

midvidt

=∑j

F2(ri, rj) + F3(ri, rj, rk) + . . . (32)

dridt

= vi

where mi is the mass of atom i, ri and vi are its position and velocity vectors, F2 is a forcefunction describing pairwise interactions between atoms, F3 describes three-body interactions,and many-body interactions can be added. The force terms are derivatives of energy expressions(the potentials) in which the potential energy of atom i is typically written as a function of thepositions of itself and other atoms. In practice, only one or a few terms in equation (32) are keptand F2, F3, etc. are constructed so as to include many-body and quantum effects. To the extentthe approximations are accurate these equations give a full description of the time-evolution ofthe system7.

7Thus, the great computational advantage of classical MD, as compared to the so-called ab initio electronicstructure calculations, is that the dynamic behavior of the atomic system is described empirically without having tosolve the computational expensive but ”exact” Schrodinger’s equation at each time step.

23

4.3 Neighbor lists and link-cell method 4 COMPUTATIONAL ASPECTS OF MD

The force terms in equation (32) are typically non-linear functions of the distance rij be-tween pairs of atoms and may be either long-range or short-range in nature. For long-rangeforces, such as Coulombic interactions in an ionic solid or biological system, each atom interactswith all others. Directly computing these forces scales as N2 and is too costly for large N . Var-ious approximate methods, valid in different conditions, overcome this difficulty. They includeparticle-mesh algorithms [10], which scale as f(M)N where M is the number of mesh points,hierarchical methods [11] which scale as N log(N), and fast-multipole methods [12] which scaleas N . Modern parallel implementations of these algorithms have improved their range of appli-cability for many-body simulations, but because of their expense, long-range force models arenot dealt in this course.

By contrast, short-range force models, and in particular the Lennard-Jones one introduced inthe paragraph 2.1.1, are our main concern. They are chosen either because electronic screeningeffectively limits the range of influence of the interatomic forces being modeled or simply totruncate the long-range interactions and lessen the computational load. In either case, the sum-mations in equation (32) are restricted to atoms within some small region surrounding atom i.As we have seen in the paragraph 2.1.2, this is typically implemented using a cutoff distance Rc,outside of which all interactions are ignored. The work to compute forces now scales linearlywith N . Notwithstanding this savings, the vast majority of computation time spent in a short-range force MD simulation is in evaluating the force terms in equation (32), even in the case (asthe LJ model) only the F2 term has to be evaluated. The time integration typically requires only2-3% of the total time. To evaluate the sums efficiently requires knowing which atoms are withinthe cutoff distance Rc at every time step. The key is to minimize the number of neighboringatoms that must be checked for possible interactions since calculations performed on neighborsat a distance r > Rc are wasted computation. There are two basic techniques used to accomplishthis on serial (and vector machines). We discuss them since our parallel algorithms incorporatesimilar ideas.

4.3 Neighbor lists and link-cell methodThe first idea to minimize the number of neighboring atoms that must be checked for possibleinteractions is the one of neighbor lists. It was originally proposed by Verlet [2]. For each atom,a list is maintained of nearby atoms. Typically, when the list is formed, all neighboring atomswithin an extended cutoff distance Rs = Rc + δ are stored. The list is used for a few timesteps to calculate all force interactions. Then it is rebuilt before any atom could have movedfrom a distance r > Rs to r < Rc. Though δ is always chosen to be small relative to Rc, anoptimal value depends on the parameters (e.g. temperature, diffusivity, density) of the particularsimulation. The advantage of the neighbor list is that once it is built, examining it for possibleinteractions is much faster than checking all atoms in the system.

The second technique commonly used for speeding up MD calculations is known as the link-cell method [13]. At every time step, all the atoms are binned into 3-D cells of side length `where` = Rc or slightly larger. This reduces the task of finding neighbors of a given atom to checkingin 27 bins — the bin the atom is in and the 26 surrounding ones. Since binning the atoms onlyrequires O(N) work, the extra overhead associated with it is acceptable for the savings of only

24

4.4 Parallel algorithms 4 COMPUTATIONAL ASPECTS OF MD

having to check a local region for neighbors.The fastest serial short-range MD algorithms use a combination of neighbor lists and link-

cell binning. In the combined method, atoms are only binned once every few time steps for thepurpose of forming neighbor lists. In this case atoms are binned into cells of size ` ≥ Rs. Atintermediate time steps the neighbor lists alone are used in the usual way to find neighbors withina distance Rc of each atom. This is a significant savings over a conventional link-cell methodsince there are far fewer atoms to check in a sphere of volume 4πR3

s/3 than in a cube of volume27R3

c . Additional savings can be gained due to Newton’s 3rd law:

Fij = −Fji (33)

by only computing a force once for each pair of atoms (rather than once for each atom in thepair). In the combined method this is done by only searching half the surrounding bins of eachatom to form its neighbor list. This has the effect of storing atom j in atom i’s list, but not atomi in atom j’s list, thus halving the number of force computations that must be done.

4.4 Parallel algorithmsIn the last 20 years there has been considerable interest in devising parallel MD algorithms.The natural parallelism in MD is that the force calculations and velocity/position updates can bedone simultaneously for all atoms. To date, two basic ideas have been exploited to achieve thisparallelism. The goal in each is to divide the force computations in equation (32) evenly acrossthe processors so as to extract maximum parallelism. All the algorithms that have been proposedor implemented so far are variations on these two methods.

In the first class of methods a pre-determined set of force computations is assigned to eachprocessor. The assignment remains fixed for the duration of the simulation. The simplestway of doing this is to give a subgroup of atoms to each processor. We call this method anatom-decomposition of the workload, since the processor computes forces on its atoms no matterwhere they move in the simulation domain. More generally, a subset of the force loops inherentin equation (32) can be assigned to each processor. We term this a force-decomposition. Bothof these decompositions are analogous to ”Lagrangian gridding” in a fluids simulation wherethe grid cells (computational elements) move with the fluid (atoms in MD). By contrast, in thesecond general class of methods, which we call a spatial-decomposition of the workload, eachprocessor is assigned a portion of the physical simulation domain. Each processor computes onlythe forces on atoms in its sub-domain. As the simulation progresses, processors exchange atomsas they move from one sub-domain to another. This is analogous to an ”Eulerian gridding” for afluids simulation where the grid remains fixed in space as fluid moves through it.

Within the two classes of methods for parallelization of MD, a variety of algorithms havebeen proposed and implemented by various researchers. The details of the algorithms varywidely from one parallel machine to another since there are numerous problem-dependent andmachine-dependent trade-offs to consider, such as the relative speeds of computation and com-munication.

Atom-decomposition methods, also called replicated-data methods because identical copiesof atom information are stored on all processors, have been often used in short-range MD simu-

25

4.4 Parallel algorithms 4 COMPUTATIONAL ASPECTS OF MD

lations of molecular systems. This is because the duplication of information makes for straight-forward computation of additional three- and four-body force terms. Parallel implementations ofalmost outdatedt biological MD programs such as CHARMM [14] and GROMOS [15] use thistechnique.

Spatial-decomposition methods, also called geometric methods are now more common inthe literature because they are well-suited to very large MD simulations and when long-rangeinteraction are involved. Very recent parallel implementations of state-of-the art biological MDprograms such as AMBER [16], GROMACS [17] and NAMD [18] use this technique.

26

5 COMPUTATIONAL ASPECTS OF OUR MODEL

5 Computational aspects of our model

5.1 Reduced unitsUnit systems are invented to make physical laws look simple and numerical calculations easy.Take Newton’s law: F = ma. In the SI unit system, this means that if an object of mass x (kg)is undergoing an acceleration of y (m/s2), the force on the object must be xy (N).

However, there is nothing intrinsically special about the SI unit system. One (kg) is simplythe mass of a platinum-iridium prototype in a vacuum chamber in Paris. If one wishes, one candefine his or her own mass unit — e.g. (kg), which say is 1/7 of the mass of the Paris prototype:1 (kg) = 7 (kg).

If (kg) is oneOs choice of the mass unit, how about the unit system? One really has to make adecision here, which is either keeping all the other units unchanged and only making the (kg)?(kg) transition, or, changing some other units along with the (kg)?( kg) transition.

Imagine making the first choice, i.e. keeping all the other units of the SI system unchanged,including the force unit (N), and only changes the mass unit from (kg) to (kg). That is all right,except in the new unit system the NewtonOs law must be re-expressed as F = ma/7, because ifan object of mass 7x (kg) is undergoing an acceleration of y (m/s2), the force on the object is xy(N).

There is nothing inherently wrong with the F = ma/7 expression, which is just a recipefor computation — a correct one for the newly chosen unit system. Fundamentally, F = ma/7and F = ma describe the same physical law. But it is true that F = ma/7 is less elegant thanF = ma. No one likes to memorize extra constants if they can be reduced to unity by a sensiblechoice of units. The SI unit system is sensible, because (N) is picked to work with other SI unitsto satisfy F = ma.

How may we have a sensible unit system but with (kg) as the mass unit? Simple, just define(N) = (N)/7 as the new force unit. The (m)-(s)-(kg)-(N)-unit system is sensible because thesimplest form of F = ma is preserved. Thus we see that when a certain unit in a sensibleunit system is altered, other units must also be altered correspondingly in order to constitutea new sensible unit system, which keeps the algebraic forms of all fundamental physical lawsunaltered8.

In science people have formed deep-rooted conventions about the simplest algebraic formsof physical laws, such as F = ma, K = mv2/2, E = K + U , etc. Although nothing forbidsone from modifying the constant coefficients in front of each expression, one is better off notto. Fortunately, as long as one uses a sensible unit system, these algebraic expressions staysinvariant.

Now, imagine we derive a certain composite law from a set of simple laws. On one side,we start with and consistently use a sensible unit system A. On the other side, we start withand consistently use another sensible unit system B. Since the two sides use exactly the samealgebraic forms, the resultant algebraic expression must also be the same, even though for agiven physical instance, a variable takes on two different numerical values on the two sides as

8A notable exception is the conversion between SI and Gaussian unit systems in electrodynamics, during whicha non-trivial factor of 4π comes up.

27

5.1 Reduced units 5 COMPUTATIONAL ASPECTS OF OUR MODEL

different unit systems are adopted. This means that the final algebraic expression describingthe physical phenomena must satisfy certain concerted scaling invariance with respect to itsdependent variables, corresponding to any feasible transformation between sensible unit systems.This strongly limits the form of possible algebraic expressions describing physical phenomena,which is the basis of dimensional analysis.

In the MD world, there is also another reason why people prefer to choose ”non-conventional”units. Typical simulation sizes in molecular dynamics simulation are very small up to 1000atoms. As a consequence, most of the extensive quantities are small in magnitude when mea-sured in macroscopic units. There are two possibilities to overcome this problem: either oneshould work with atomic-scale units (ps, a.m.u., nm) or to make all the observable quantitiesdimensionless with respect to their characteristic values. The second approach is more popular.The scaling is done with the model parameters e.g. in the case of LJ model: size σ, energy ε,mass m. So the common recipe is, one chooses a value for one atom/molecule pair potentialarbitrarily (ε) and then other model parameters (say total energy E) are given in terms of thisreference value (E∗ = E/ε). The other parameters are also calculated similarly. For example,dimensionless

distance r∗ =r

σ, (34)

energy E∗ =E

ε, (35)

temperature T ∗ =kBT

ε, (36)

pressure P ∗ =Pσ3

ε, (37)

time t∗ =t

σ√m/ε

, (38)

force F ∗ =Fσ

ε, (39)

density ρ∗ =ρσ3

m(40)

and so on.These are the so-called reduced units for the LJ potential model that we will adopt in our

codes.If we write the LJ potential (6) in dimensionless form

φ∗LJ(r∗) = 4

[(1

r∗

)12

−(

1

r∗

)6]

(41)

we see that it is parameter independent, consequently all the properties must also be parameterindependent. If a potential only has a couple of parameters then this scaling has a lot of ad-vantages. Namely, potential evaluation can be really efficient in reduced units and as the resultsare always the same, so the results can be transferred to different systems with straight forward

28

5.2 Simulation parameters 5 COMPUTATIONAL ASPECTS OF OUR MODEL

scaling by using the model parameters σ, ε and m. This is equivalent to selecting unit value forthe parameters and it is convenient to report system properties in this form e.g P ∗(ρ∗).

For reference, Tab. 1 show the values of LJ model parameters for some noble gases whichare well described by such a physical model.

σ (A) ε/kB (K) m (a.m.u.)

Ne 0.278 37.29 20.18Ar 3.405 119.8 39.95Kr 3.680 166.6 83.80Xe 3.924 257.4 131.29

Table 1: LJ model parameters for some nobel gases.

5.2 Simulation parametersAs mentioned several times in the previous chapters we will study the model of N identicalpoint-like atoms in a box where periodic boundary conditions are applied in all the directions,and where atoms interact one each other only through a Lennard-Jones pairwise potential (41)in dimensionless reduced units. The derivative of (41) with respect to the relative distance rbetween two particles is the F2 term in equation (32) (F3 and higher order terms are ignored).

The N atoms are simulated in a 3D parallelepiped with periodic boundary conditions at theLennard-Jones state point defined by the reduced density:

ρ∗ = 0.8442

and reduced temperature:T ∗ = 0.72

This is a liquid state near the Lennard-Jones triple point9.The simulation will begin with the atoms on a face-centered cube (fcc) lattice (Fig. 2) with

randomized velocities. The fcc configuration represents a minimal energy configuration (groundstate) and consequently it would be stable at T = 0, i.e. without kinetic energy as a solid crystal.Since we will provide kinetic energy (T 6= 0) to that system, it quickly melts evolving to itsnatural liquid state. The equilibration phase, in this very simple model, happens very fast.

A roughly uniform spatial density persists for the duration of the simulation.The simulation is run at constant N , volume V , and total energy E, a statistical sampling

from the microcanonical ensemble.Force computations using the potential in equation (41) are truncated at a cutoff distance

R∗c = 2.5. Using such cutoff, in our simulation each atom has on average 55 neighbor. Ifneighbor list are used, an extended cutoff length R∗s = 2.8 is defined (encompassing about 78

9A triple point is a point on the phase diagram, for which three distinct thermodynamic phases coexist in equi-librium. For the LJ model without truncation, Mastny and de Pablo found it at T ∗ = 0.694 and ρ∗ = 0.96 [19].

29

5.2 Simulation parameters 5 COMPUTATIONAL ASPECTS OF OUR MODEL

Figure 2: Face-centered cube (fcc) lattice cell, used as initial configuration of our system.

atoms) for forming neighbor lists and the lists are updated every 20 time steps. The cost ofcreating neighbor lists every 20 time steps is amortized over the per time step timing.

The integration time step we are going to use in the leap-frog algorithm is

∆t∗ = 0.00462.

In the case of the Ar, the constant to convert times between reduced units and conventional onesis: according to (38) and the values of the parameters from Tab. 1:√

mσ2

ε= 2.156x10−12,

thus the reduced time unit is 2.156 ps. This is roughly the timescale of one atomic oscillationperiod in Ar. The integration time step for Ar corresponds to

∆t = ∆t∗ × 2.156 ps ' 0.01 ps.

In other words, we choose such integration time step (∼ 1/100 of atomic oscillation period) sothat to accurately describe even such elementary atomic motion.

30

6 ATOM-DECOMPOSITION ALGORITHM

6 Atom-decomposition algorithmThe first parallel algorithm that we will take into account is the atom-decomposition (AD) al-gorithm. In this algorithm each of the P processors is assigned a group of N/P atoms at thebeginning of the simulation. Atoms in a group need not have any special spatial relationshipto each other. For ease of exposition, we assume N is a multiple of P , though it is simple torelax this constraint. A processor will compute forces on only its N/P atoms and will updatetheir positions and velocities for the duration of the simulation no matter where they move in thephysical domain. This is an atom-decomposition of the computational workload.

A useful construct for representing the computational work involved in the algorithm is theN × N force matrix F . The (ij) element of F represents the force on atom i due to atom j.Note that F is sparse due to short-range forces and skew-symmetric, i.e. Fij = −Fji due toNewton’s 3rd law. We also define x and f as vectors of length N which store the position andtotal force on each atom. For a 3-D simulation, xi would store the three coordinates of atom i.With these definitions, the AD algorithm assigns each processor a sub-block of F which consistsof N/P rows of the matrix, as shown in Figure 3. If z indexes the processors from 0 to P − 1,then processor Pz computes matrix elements in the Fz sub-block of rows. It also is assigned thecorresponding sub-vectors of length N/P denoted by xz and fz.

Figure 3: The division of the force matrix among P processors in the atom-decomposition algorithm.Processor Pz is assignedN/P rows of the matrix and the corresponding xz piece of the positionvector. In addition, it must know the entire position vector x to compute the matrix elements intheir own rows.

For the LJ potential, the computation of matrix element Fij requires only the two atom po-sitions xi and xj . To compute all the elements in Fz, processor Pz will need the positions ofmany atoms owned by other processors. This implies that at every time step each processor mustreceive updated atom positions from all the other processors, an operation called all-to-all com-munication. Various algorithms have been developed for performing this operation efficientlyon different parallel machines and architectures. Usually, the MPI library implementation ofthe all-to-all communication operations takes advantages of such optimized algorithms. In thecourse we will implement such operations by using MPI library. Moreover, we will also try toimplement such operations by using only point-to-point communication operations to compare itwith the MPI implementation. In particular, we will resort an idea outlined in [20], that is simple,portable, and works well on a variety of machines.

31

6.1 Expand and fold operations 6 ATOM-DECOMPOSITION ALGORITHM

6.1 Expand and fold operationsFollowing Fox’s nomenclature [20], we term the all-to-all communication procedure describedabove an expand operation. Each processor allocates memory of length N to store the entirex vector. At the beginning of the expand, processor Pz has xz, an updated piece of x of lengthN/P . Each processor needs to acquire all the other processor’s pieces, storing them in the correctplaces in its copy of x. Figure 4a illustrates the steps that accomplish this for an 8-processorexample. The processors are mapped consecutively to the sub-pieces of the vector. In the firstcommunication step, each processor partners with an adjacent processor in the vector and theyexchange sub-pieces. Processor 2 partners with 3. Now, every processor has a contiguous pieceof x that is of length 2N/P . In the second step, each processor partners with a processor twopositions away and exchanges its new piece (2 receives the shaded sub-vectors from 0). Eachprocessor now has a 4N/P−length piece of x. In the last step, each processor exchanges anN/2− length piece of x with a processor P/2 positions away (2 exchanges with 6); the entirevector now resides on each processor.

The expand operation can also be accomplished by using the mpi allgather collective opera-tion provided by the MPI library.

Figure 4: Expand and fold operations among eight processors, each of which requires three steps. (a)In the expand, processor 2 receives successively longer shaded sub-vectors from processor 3,0, and 6; (b) In the fold, processor 2 receives successively shorter shaded sub-vectors fromprocessors 6, 0, and 3.

A communication operation that is essentially the inverse of the expand will also prove use-ful in the atom-decomposition algorithm. Assume each processor has stored new force valuesthroughout its copy of the force vector f . Processor Pz needs to know the N/ P values in fz,

32

6.2 The simpler algorithm 6 ATOM-DECOMPOSITION ALGORITHM

where each of the values is summed across all P processors. This is known as a fold operationand is outlined in Figure 4b. In the first step each processor exchanges half the vector with aprocessor it partners with that is P/2 positions away. Note that each processor receives the halfthat it is a member of and sends the half it is not a member of (processor 2 receives the shadedfirst half of the vector from 6). Each processor sums the received values with its correspondingretained sub-vector. This operation is recursed, halving the length of the exchanged data at eachstep.

The fold operation can also be accomplished by using the mpi reduce collective operationprovided by the MPI library.

Costs for a communication algorithm are typically quantified by the number of messages andthe total volume of data sent and received. On both these accounts the expand and fold of Figure4 are optimal, each processor performs log2(P ) sends and receives and exchangesN−N/P datavalues. Each processor also performs N − N/P additions in the fold. A drawback is that thealgorithms require O(N) storage on every processor. Alternative methods for performing all-to-all communication require less storage at the cost of more sends and receives. This is usuallynot a good trade-off for MD simulations because quite large problems can be run with the fewGbytes of local memory available on current-generation processors.

We now describe two versions of an AD algorithm which use expand and fold operations.

6.2 The simpler algorithmThe first algorithm is simpler and does not take advantage of Newton’s 3rd law. We call thisalgorithm A1: it is outlined in Figure 5 with the dominating term(s) in the computation or com-munication cost of each step listed on the right. We assume at the beginning of the time stepthat each processor knows the current positions of all N atoms, i.e. each has an updated copyof the entire x vector. Step (1) of the algorithm is to construct neighbor lists for all the pairwiseinteractions that must be computed in block Fz. Typically this will only be done once every fewtime steps. If the ratio of the physical domain diameter S to the extended force cutoff length Rs

is relatively small, it is quicker for Pz to construct the lists by checking all N2/P pairs in its Fzblock. When the simulation is large enough that 4 or more bins can be created in each dimension,it is quicker for each processor to bin all N atoms, then check the 27 surrounding bins of each ofits N/P atoms to form the lists. This checking scales as N/P but has a large coefficient, so theoverall scaling of the binned neighbor list construction is recorded as N/P +N .

In step (2) of the algorithm, the neighbor lists are used to compute the non-zero matrix ele-ments in Fz. As each pairwise force interaction is computed, the force components are summedinto fz so that Fz is never actually stored as a matrix. At the completion of the step, each pro-cessor knows the total force fz on each of its N/P atoms . This is used to update their positionsand velocities in step (4). (A step (3) will be added to the other algorithm.) Finally, in step (5)the updated atom positions in xz are shared among all P processors in preparation for the nexttime step via the expand operation of Figure 4a. As discussed above, this operation scales as N ,the volume of data in the position vector x.

33

6.3 Using Newton’s 3rd law 6 ATOM-DECOMPOSITION ALGORITHM

(1) Construct neighbor lists of non-zero interactions in Fz

(S < 4Rs) All pairs N2

P

(S ≥ 4Rs) Binning NP

+N

(2) Compute elements of Fz, summing results into fz NP

(4) Update atom positions in xz using fz NP

(5) Expand xz among all processors, result in x N

Figure 5: Single time step of atom-decomposition algorithm A1 for processor Pz .

6.3 Taking advantage of Newton’s 3rd lawAs mentioned in the previous paragraph, algorithm A1 ignores Newton’s 3rd law. If differentprocessors own atoms i and j as is usually the case, both processors compute the (ij) interactionand store the resulting force on their atom. This can be avoided at the cost of more communica-tion by using a modified force matrix G which references each pairwise interaction only once.There are several ways to do this by striping the force matrix [21]; we choose instead to form Gas follows. Let Gij = Fij except that Gij = 0 when i > j and i+ j is even, and likewise Gij = 0when i < j and i + j is odd. Conceptually, G is colored like a checkerboard with red squaresabove the diagonal set to zero and black squares below the diagonal also set to zero. A modifiedAD algorithm A2 that uses G to take advantage of Newton’s 3rd law is outlined in Figure 6.

(1) Construct neighbor lists of non-zero interactions in Gz


2P

(S ≥ 4Rs) Binning N2P

+N

(2) Compute elements of Gz, doubly summing results into local copy of f N2P

(3) Fold f among all processors, result is fz N


(5) Expand xz among all processors, result in x N

Figure 6: Single time step of atom-decomposition algorithm A2 for processor Pz , which take advantageof Newton’s 3rd law.

Step (1) is the same as in algorithm A1 except only half as many neighbor list entries aremade by each processor since Gz has only half the non-zero entries of Fz. This is reflected inthe factors of two included in the scaling entries. For neighbor lists formed by binning, each

34

6.4 Load-balance 6 ATOM-DECOMPOSITION ALGORITHM

processor must still bin all N atoms, but only need check half the surrounding bins of each of itsN/P atoms. In step (2) the neighbor lists are used to compute elements of Gz. For an interactionbetween atoms i and j, the resulting forces on atoms i and j are summed into both the i andj locations of force vector f . This means each processor must store a copy of the entire forcevector, as opposed to just storing fz as in algorithm A1. When all the matrix elements have beencomputed, f is folded across all P processors using the algorithm in Figure 4b. Each processorends up with fz, the total forces on its atoms. Steps (4) and (5) then proceed the same as in A1.

Note that implementing Newton’s 3rd law essentially halved the computation cost in steps (1)and (2) at the expense of doubling the communication cost. There are now two communicationsteps (3) and (5), each of which scale as N . This will only be a net gain if the communicationcost in A1 is less than a third of the overall run time. This is usually not the case on large numbersof processors, so in practice we almost always choose A1 instead of A2 for an AD algorithm.However, for small P or expensive force models, A2 can be faster.

6.4 Load-balanceEach processor will have an equal amount of work if each Fz or Gz block has roughly the samenumber of non-zero elements. This will be the case if the atom density is uniform across thesimulation domain. However non-uniform densities can arise if, for example, there are freesurfaces so that some atoms border on vacuum, or phase changes are occurring within a liquid orSolid. This is only a problem for load-balance if the N atoms are ordered in a geometric senseas is typically the case. Then a group of N/P atoms near a surface, for example, will have fewerneighbors than groups in the interior. This can be overcome by randomly permuting the atomordering at the beginning of the simulation, which is equivalent to permuting rows and columnsof F or G. This insures that every Fz or Gz will have roughly the same number of non-zeros.A random permutation also has the advantage that the load-balance will likely persist as atomsmove about during the simulation. Note that this permutation needs only to be done once, as apre-processing step before beginning the dynamics.

In summary, the AD algorithms divide the MD force computation and integration evenlyacross the processors (ignoring the O(N) component of binned neighbor list construction whichis usually not significant). However, the algorithms require global communication, as each pro-cessor must acquire information held by all the other processors. This communication scales asN , independent of P , so it limits the number of processors that can be used effectively. Thechief advantage of the algorithms is that of simplicity. Steps (1), (2) and (4) can be implementedby simply modifying the loops and data structures in a serial or vector code to treat N/P atomsinstead ofN . The expand and fold communication operations (3) and (5) can be treated as black-box routines (as for example by using the MPI operations) and inserted at the proper locations inthe code. Few other changes are typically necessary to parallelize an existing code.

35

7 FORCE-DECOMPOSITION ALGORITHM

7 Force-decomposition algorithmOur next parallel MD algorithm is based on a block-decomposition of the force matrix rather thana row-wise decomposition as used in the previous chapter. We call this a force-decomposition(FD) of the workload. As we shall see, this improves the O(N) scaling of the communicationcost to O(N/

√P ). Block-decompositions of matrices are common in linear algebra algorithms

for parallel machines even if they are not common for short-range MD simulations. The assign-ment of sub-blocks of the force matrix to processors with a row-wise (calendar) ordering of theprocessors is depicted in Figure 7. We assume for ease of exposition that P is an even power of2 and that N is a multiple of P , although again it is straightforward to relax these constraints.As before, we let z index the processors from 0 to P − 1; processor Pz owns and will update theN/P atoms stored in the sub-vector xz.

Figure 7: The division of the permuted force matrix F ′ among 16 processors in the force-decompositionalgorithm. Processor P6 is assigned a sub-block F ′6 of size N/

√P by N/

√P . To compute its

matrix elements it must know the correspondingN/√P -length pieces xα and x′β of the position

vector x and permuted position vector x′.

To reduce communication (explained below) the block-decomposition in Figure 7 is actuallyof a permuted force matrix F ′ which is formed by rearranging the columns of F in a particu-lar way. If we order the xz pieces in row-order, they form the usual position vector x whichis shown as a vertical bar at the left of the figure. Were we to have x span the columns asin Figure 3 we would form the force matrix as before. Instead, we span the columns with apermuted position vector x′ shown as a horizontal bar at the top of Figure 7 in which the xzpieces are stored in column-order. Thus, in the 16-processor example shown in the figure, xstores each processor’s piece in the usual order (0, 1, 2, 3, 4, . . . , 14, 15) while x′ stores them as

36

7.1 The simpler algorithm 7 FORCE-DECOMPOSITION ALGORITHM

(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15). Now the (ij) element of F ′ is the force on atom iin vector x due to atom j in permuted vector x′.

The F ′z sub-block owned by each processor Pz is of size (N/√P ) × (N/

√P ). To compute

the matrix elements in F ′z, processor Pz must know one N/√P -length piece of each of the x

and x′ vectors, which we denote as xα and x′β . As these elements are computed they will beaccumulated into corresponding force sub-vectors fα and f ′β . The Greek subscripts α and β eachrun from 0 to

√P − 1 and reference the row and column position occupied by processor Pz.

Thus for processor 6 in the figure, xα consists of the x sub-vectors (4,5,6,7) and x′β consists ofthe x′ sub-vectors (2,6,10,14).

7.1 The simpler algorithmOur first FD algorithm F1 is outlined in Figure 8. As before, each processor has updated copiesof the needed atom positions xα and x′β at the beginning of the time step. In step (1) neighborlists are constructed. Again, for small problems this is most quickly done by checking all N2/Ppossible pairs in F ′z. For large problems, the N/

√P atoms in x′β are binned, then the 27 sur-

rounding bins of each atom in xα is checked. The checking procedure scales as N/√P , so the

scaling of the binned neighbor list construction is still N/√P . In step (2) the neighbor lists are

used to compute the matrix elements in F ′z. As before the elements are summed into a local copyof fα as they are computed so F ′z never need be stored in matrix form. In step (3) a fold operationis performed within each row of processors so that processor Pz obtains the total forces fz on itsN/P atoms. Although the fold algorithm used is the same as in the preceding chapter, there is akey difference. In this case the vector fα being folded is only of length N/

√P and only the

√P

processors in one row are participating in the fold. Thus this operation scales as N/√P instead

of N as in the AD algorithm.In step (4), fz is used by Pz to update the N/P atom positions in xz. Steps (5a-5b) share

these updated positions with all the processors that will need them for the next time step. Theseare the processors which share a row or column with Pz. First, in (5a), the processors in row αperform an expand of their xz sub-vectors so that each acquires the entire xα. As with the fold,this operation scales as the N/

√P length of xα instead of as N as it did in algorithms A1 and

A2. Similarly, in step (5b), the processors in each column β perform an expand of their xz. As aresult they all acquire x′β and are ready to begin the next time step.

It is in step (5) that using a permuted force matrix F ′ saves extra communication. The per-muted form of F ′ causes xz to be a component of both xα and x′β for each Pz. This wouldnot be the case if we had block-decomposed the original force matrix F by having x span thecolumns instead of x′. Then in Figure 7 the xβ for P6 would have consisted of the sub-vectors(8,9,10,11), none of which components are known by P6. Thus, before performing the expandin step (5b), processor 6 would need to first acquire one of these 4 components from anotherprocessor (in the transpose position in the matrix) requiring an extra O(N/P ) communicationstep. The transpose-free version of the FD algorithms presented here was motivated by a matrixpermutation for parallel matrix-vector multiplication discussed in reference [22].

37

7.2 Using Newton’s 3rd law 7 FORCE-DECOMPOSITION ALGORITHM

(1) Construct neighbor lists of non-zero interactions in F ′z


P

(S ≥ 4Rs) Binning N√P

(2) Compute elements of F ′z, storing results in fα NP

(3) Fold fα within row α, result is fz N√P


(5a) Expand xz within row α, result in xα N√P

(5b) Expand xz within column β, result in x′βN√P

Figure 8: Single time step of force-decomposition algorithm F1 for processor Pz .

7.2 Taking advantage of Newton’s 3rd lawAs with algorithm A1 algorithm F1 does not take advantage of Newton’s 3rd law; each pairwiseforce interaction is computed twice. Algorithm F2 avoids this duplicated effort by checker-boarding the force matrix as in the preceding chapter. Specifically, the checkerboarded matrixG is permuted in the same way as F was, to form G′. Note that now the total force on atomi is the sum of all matrix elements in row i minus the sum of all elements in column i. Themodified FD algorithm F2 is outlined in Figure 9. Step (1) is the same as in F1 except that halfas many interactions are stored in the neighbor lists. Likewise, step (2) requires only half asmany matrix elements be computed. For each (ij) element, the computed force components arenow summed into two force sub-vectors instead of one. The force on atom i is summed into fαin the location corresponding to row i. Likewise, the force on atom j is summed into f ′β in thelocation corresponding to column j. Steps (3a-3c) accumulate these forces so that processor Pzends up with the total force on its N/P atoms. First, in step (3a), the

√P processors in column

β fold their local copies of f ′β . The result is f ′z. Each element of this N/P -length sub-vectoris the sum of an entire column of G’. Next, in step (3b), the row contributions to the forces aresummed by performing a fold of the fα vector within each row α. The result is fz, each elementof which is the sum across a row of G′. Finally, in step (3c) the column and row contributionsare subtracted, element by element, to yield the total forces fz on the atoms owned by processorPz. The processor can now update the positions and velocities of its atoms; steps (4) and (5) areidentical to those of F1.

In the FD algorithms, exploiting Newton’s 3rd law again halves the computation requiredin steps (1) and (2). However, the communication cost in steps (3) and (5) does not double.Rather there are 4 expands and folds required in F2 versus 3 in F1. Thus, in practice, it isusually faster to use algorithm F2 with its reduced computational cost and slightly increasedcommunication cost rather than F1. The key point is that all the expand and fold operations inF1 and F2 scale as N/

√P rather than as N as was the case in algorithms A1 and A2. When you

38

7.3 Load-balance 7 FORCE-DECOMPOSITION ALGORITHM

(1) Construct neighbor lists of non-zero interactions in G′z


2P

(S ≥ 4Rs) Binning N2√P

(2) Compute elements of = G′z, storing results in fα and f ′βN2P

(3a) Fold f ′β within column β, result is f ′zN√P

(3b) Fold fα within row α, result is fz N√P

(3c) Subtract f ′z from fz, result is total fz NP


(5a) Expand xz within row α, result in xα N√P

(5b) Expand xz within column β, result in x′βN√P

Figure 9: Single time step of force-decomposition algorithm F2 for processor Pz , which takes advantageof Newton’s third law.

run on large numbers of processors this significantly reduces the time the FD algorithms spendon communication as compared to the AD algorithms.

7.3 Load-balanceFinally, the issue of load-balance is a more serious concern for the FD algorithms. Processorswill have equal work to do only if all the matrix blocks F ′z or G′z are uniformly sparse. If theatoms are ordered geometrically this will not be the case even for problems with uniform density.This is because such an ordering creates a force matrix with diagonal bands of non-zero elements.As in the AD case, a random permutation of the atom ordering produces the desired effect. Onlynow the permutation should be done as a pre-processing step for all problems, even those withuniform atom densities.

In summary, algorithms F1 and F2 divide the MD computations evenly across processorsas did the AD algorithms. But the block-decomposition of the force matrix means each proces-sor only needs O(N/

√P ) information to perform its computations. Thus the communication

and memory costs are reduced by a factor of√P versus algorithms A1 and A2. The FD strat-

egy retains the simplicity of the AD technique; F1 and F2 can be implemented using the same“black-box” communication routines as A1 and A2. The FD algorithms also need no geometricinformation about the physical problem being modeled to perform optimally. In fact, for load-balancing purposes the algorithms intentionally ignore such information by using a random atomordering

39

8 SPATIAL-DECOMPOSITION ALGORITHM

8 Spatial-decomposition algorithmIn our final parallel algorithm the physical simulation domain is subdivided into small 3-D boxes,one for each processor. We call this a spatial-decomposition (SD) of the workload. Each proces-sor computes forces on and updates the positions and velocities of all atoms within its box at eachtime step. Atoms are reassigned to new processors as they move through the physical domain. Inorder to compute forces on its atoms, a processor need only know positions of atoms in nearbyboxes. The communication required in the SD algorithm is thus local in nature as compared toglobal in the AD and FD cases.

The size and shape of the box assigned to each processor will depend on N , P , and theaspect ratio of the physical domain, which we assume to be a 3-D rectangular parallelepiped.Within these constraints the number of processors in each dimension is chosen so as to makeeach processor’s box as “cubic” as possible. This is to minimize communication since in thelarge N limit the communication cost of the SD algorithm will turn out to be proportional to thesurface area of the boxes. An important point to note is that in contrast to the link-cell methoddescribed in Section 4.3, the box lengths may now be smaller or larger than the force cutofflengths Rc and Rs.

Each processor in our SD algorithm maintains two data structures, one for the N/P atomsin its box and one for atoms in nearby boxes. In the first data structure, each processor storescomplete information positions, velocities, neighbor lists, etc. This data is stored in a linkedlist to allow insertions and deletions as atoms move to new boxes. In the second data struc-ture only atom positions are stored. Inter-processor communication at each time step keeps thisinformation current.

8.1 The communication schemeThe communication scheme we use to acquire this information from processors owning thenearby boxes is shown in Figure 10. The first step (a) is for each processor to exchange informa-tion with adjacent processors in the east/west dimension. Processor 2 fills a message buffer withatom positions it owns that are within a force cutoff length Rs of processor 1’s box.10 If s < Rs,where s is the box length in the east/west direction, this will be all of processor 2’s atoms; oth-erwise it will be those nearest to box 1. Now each processor sends its message to the processorin the westward direction (2 sends to 1) and receives a message from the eastward direction.Each processor puts the received information into its second data structure. Now the procedureis reversed with each processor sending to the east and receiving from the west. If s > Rs, allneeded atom positions in the east-west dimension have now been acquired by each processor. Ifs < Rs, the east-west steps are repeated with each processor sending more needed atom posi-tions to its adjacent processors. For example, processor 2 sends processor 1 atom positions frombox 3 (which processor 2 now has in its second data structure). This can be repeated until eachprocessor knows all atom positions within a distance Rs of its box, as indicated by the dottedboxes in the figure. The same procedure is now repeated in the north/south dimension (see step

10The reason for using Rs instead of Rc will be made clear further down this chapter

40

8.1 The communication scheme 8 SPATIAL-DECOMPOSITION ALGORITHM

(b) of the Figure 10). The only difference is that messages sent to the adjacent processor nowcontain not only atoms the processor owns (in its first data structure), but also any atom positionsin its second data structure that are needed by the adjacent processor. For s = Rs, this has theeffect of sending 3 boxes worth of atom positions in one message as shown in (b). Finally, instep (c) the process is repeated in the up/down dimension. Now atom positions from an entireplane of boxes (9 in the figure) are being sent in each message.

Figure 10: Method by which a processor acquires nearby atom positions in the spatial-decompositionalgorithm. In six data exchanges all atom positions in adjacent boxes in the 9a) east/west, (b)north/south, and (c) up/down directions can be communicated.

There are several key advantages to this scheme, all of which reduce the overall cost ofcommunication in our algorithm.

1. First, for s ≥ Rs needed atom positions from all 26 surrounding boxes are obtained injust 6 data exchanges. Moreover, if the parallel machine has a hypercube topology (as forexample the IBM BlueGene/P supercomputer at JSC) the processors can be mapped to theboxes in such a way that all 6 of these processors will be directly connected to the centerprocessor. Thus message passing will be fast and contention-free.

2. Second, when s < Rs so that atom information is needed from more distant boxes, thisoccurs with only a few extra data exchanges, all of which are still with the 6 immediateneighbor processors. This is an important feature of the algorithm that enables it to performwell even when large numbers of processors are used on relatively small problems.

3. A third advantage is that the amount of data communicated is minimized. Each processoracquires only the atom positions that are within a distance Rs of its box. Fourth, all of

41

8.2 The simpler algorithm 8 SPATIAL-DECOMPOSITION ALGORITHM

the received atom positions can be placed as contiguous data directly into the processor’ssecond data structure. No time is spent rearranging data, except to create the bufferedmessages that need to be sent.

4. Finally, as will be discussed in more detail below, this message creation can be done veryquickly. A full scan of the two data structures is only done once every few time steps, whenthe neighbor lists are created, to decide which atom positions to send in each message. Thescan procedure creates a list of atoms that make up each message. During all the othertime steps, the lists can be used, in lieu of scanning the full atom list, to directly index thereferenced atoms and buffer up the messages quickly. This is the equivalent of a gatheroperation on a vector machine.

8.2 The simpler algorithmWe now outline our SD algorithm S1 in Figure 11. Box z is assigned to processor Pz, where zruns from 0 to P−1 as before. Processor Pz stores the atom positions of itsN/P atoms in xz andthe forces on those atoms in fz. Steps (1a-1c) are the neighbor list construction, performed onceevery few time steps. This is somewhat more complex than in the other algorithms because, asdiscussed above, it includes the creation of lists of atoms that will be communicated at every timestep. First, in step (1a) the positions, velocities, and any other identifying information of atomsthat are no longer inside box z are deleted from xz (first data structure) and stored in a messagebuffer. These atoms are exchanged with the 6 adjacent processors via the communication patternof Figure 10. As the information routes through each dimension, processor Pz checks for newatoms that are now inside its box boundaries, adding them to its xz. Next, in step (1b) all atompositions within a distance Rs of box z are acquired by the communication scheme describedabove. As the different messages are buffered by scanning through the two data structures, listsof included atoms are made. The lists will be used in step (5). The scaling factor ∆ for steps (1a)and (1b) will be explained below.

When steps (1a) and (1b) are complete, both of the processor’s data structures are current.Neighbor lists for itsN/P atoms can now be constructed in step (1c). If atoms i and j are both inbox z (an inner-box interaction), the (ij) pair is only stored once in the neighbor list. If i and j arein different boxes (a two-box interaction), both processors store the interaction in their respectiveneighbor lists. If this were not done, processors would compute forces on atoms they do not ownand communication of the forces back to the processors owning the atoms would be required. Amodified algorithm that performs this communication to avoid the duplicated force computationof two-box interactions is discussed below. When s, the length of box z, is less than two cutoffdistances, it is quicker to find neighbor interactions by checking each atom inside box z againstall the atoms in both of the processor’s data structures. This scales as the square of N/P . Ifs > 2Rs, then with the shell of atoms around box z, there are 4 or more bins in each dimension.In this case, as with the algorithms of the preceding sections, it is quicker to perform the neighborlist construction by binning. All the atoms in both data structures are mapped to bins of size Rs.The surrounding bins of each atom in box z are then checked for possible neighbors.

Processor Pz can now compute all the forces on its atoms in step (2) using the neighbor lists.

42


(1a) Move necessary atoms to new boxes ∆

(1b) Make lists of all atoms that will need to be exchanged ∆

(1c) Construct neighbor lists of interacting pairs in box z

(s < 2Rs) All pairs NP

(N2P

+ ∆)

(s ≥ 2Rs) Binning N2P

+ ∆

(2) Compute forces on atoms in box z, doubly storing results in fz N2P

+ ∆

(4) Update atom positions in xz in box z using fz NP

(5) Exchange atom positions across box boundaries

with neighboring processors NP

(1 + 2Rs/d)3

(s < Rs) Send N/P positions to many neighbors R3s

(s & Rs) Send N/P positions to nearest neighbors NP

(s� Rs) Send positions near box surface to nearest neighbors(NP

)2/3

Figure 11: Single time step of spatial-decomposition algorithm S1 for processor Pz .

When the interaction is between two atoms inside box z, the resulting force is stored twice in fz,once for atom i and once for atom j. For two-box interactions, only the force on the processor’sown atom is stored. After computing fz, the atom positions are updated in step (4). Finally,these updated positions must be communicated to the surrounding processors in preparation forthe next time step. This occurs in step (5) in the communication pattern of Figure 11 using thepreviously created lists. The amount of data exchanged in this operation is a function of therelative values of the force cutoff distance and box length, and is discuss in the next subsection.Also, we note that on the time steps that neighbor lists are constructed, step (5) does not have tobe performed since step (1b) has the same effect.

8.2.1 Communication scaling

The communication operations in algorithm S1 occur in steps (1a), (1b) and (5). The commu-nication in the latter two steps is identical. The cost of these steps scales as the volume of dataexchanged. For step (5), if we assume uniform atom density, this is proportional to the physicalvolume of the shell of thicknessRs around box z, namely (s+2Rs)

3−s3. Note there are roughlyN/P atoms in a volume of s3 since s3 is the size of box z. There are three cases to consider:

1. If s < Rs data from many neighboring boxes must be exchanged and the operation scalesas 8R3

s .

2. If s & Rs the data in all 26 surrounding boxes is exchanged and the operation scales as

43


27N/P .

3. If s � Rs only atom positions near the 6 faces of box z will be exchanged. The commu-nication then scales as the surface area of box z, namely 6Rs(N/P )2/3.

These cases are explicitly listed in the scaling of step (5). Elsewhere in Figure 11, we use theterm ∆ to represent whichever of the three is applicable for a given N , P , and Rs. We note thatstep (1a) involves less communication since not all the atoms within a cutoff distance of a boxface will move out of the box. But this operation still scales as the surface area of box z, so welist its scaling as ∆.

8.2.2 Computational scaling

The computational portion of algorithm S1 is in steps (1c), (2) and (4). All of these scale asN/Pwith additional work in steps (1c) and (2) for the atoms that are neighboring box z and stored inthe second data structure. The number of these atoms is proportional to ∆ so it is included inthe scaling of those steps. The leading term in the scaling of steps (1c) and (2) is listed as N/2Pas in algorithms A2 and F2, since inner-box interactions are only stored and computed once foreach pair of atoms in algorithm S1. Note that as s grows large relative to Rs as it will for verylarge simulations, the ∆ contribution to the overall computation time decreases and the overallscaling of algorithm S1 approaches the optimal N/2P . In essence, each processor spends nearlyall its time working in its own box and only exchanges a relatively small amount of informationwith neighboring processors to update its boundary conditions.

8.2.3 Data structures’ update

An important feature of algorithm S1 is that the data structures are only modified once every fewtime steps when neighbor lists are constructed. In particular, even if an atom moves outside boxz’s boundaries it is not reassigned to a new processor until step (1a) is executed. Processor Pzcan still compute correct forces for the atom so long as two criteria are met:

• The first is that an atom does not move farther than s between two neighbor list construc-tions.

• The second is that all nearby atoms within a distance Rs, instead of Rc, must be updatedevery time step.

The alternative is to move atoms to their new processors at every time step. This has the advan-tage that only atoms within a distance Rc of box z need to be exchanged at all time steps whenneighbor lists are not constructed. This reduces the volume of communication since Rc < Rs.However, now the neighbor list of a reassigned atom must also be sent. The information in theneighbor list is atom indices referencing local memory locations where the neighbor atoms arestored. If atoms are continuously moving to new processors, these local indices become mean-ingless. To overcome this, one can assign a global index (1 to N ) to each atom that moves withthe atom from processor to processor. A mapping of global index to local memory must then be

44

8.3 Using Newton’s 3rd law 8 SPATIAL-DECOMPOSITION ALGORITHM

stored in a vector of size N by each processor or the global indices must be sorted and searchedto find the correct atoms when they are referenced in a neighbor list. The former solution limitsthe size of problems that can be run, the latter solution incurs an extra cost for the sort and searchoperations.

On the other side, the S1 version of the code exploits a Tamayo and Giles idea [23] whichmakes it less complex and reduces the computational and communication overhead. This did notaffect the timings for simulations with large N , but improves the algorithm’s performance formedium-sized problems.

8.3 Taking advantage of Newton’s 3rd lawA modified version of S1 that takes full advantage of Newton’s 3rd law can also be devised, call italgorithm S2. If processor Pz acquires atoms only from its west, south, and down directions (andsends its own atoms only in the east, north, and up directions), then each pairwise interactionneed only be computed once, even when the two atoms reside in different boxes. This requiressending computed force results back in the opposite directions to the processors who own theatoms, as a step (3) in the algorithm. This scheme does not reduce communication costs, sincehalf as much information is communicated twice as often, but does eliminate the duplicatedforce computations for two-box interactions. Two points are worth noting. First, the overallsavings of S2 over S1 is small, particularly for large N . Only the ∆ term in steps (1c) and(2) is saved. Second, the performance of SD algorithms for large systems can be improved byoptimizing the single-processor force computation in step (2). As with vector machines thisrequires more attention be paid to data structures and loop orderings in the force and neighbor-list construction routines to achieve high single-processor flop rates. Implementing S2 requiresspecial-case coding for atoms near box edges and corners to insure all interactions are countedonly once [24], which can hinder this optimization process. For all these reasons we will notimplement an S2 code.

8.4 Load-balanceFinally, the issue of load-balance is an important concern in any SD algorithm. Algorithm S1will be load-balanced only if all boxes have a roughly equal number of atoms (and surroundingatoms). This will not be the case if the physical atom density is non-uniform. Additionally, if thephysical domain is not a rectangular parallelepiped, it can be difficult to split into P equal-sizedpieces. Sophisticated load-balancing algorithms have been developed to partition an irregularphysical domain or non-uniformly dense clusters of atoms, but they create sub-domains whichare irregular in shape or are connected in an irregular fashion to their neighboring sub-domains.In either case, the task of assigning atoms to sub-domains and communicating with neighborsbecomes more costly and complex. If the physical atom density changes over time during theMD simulation, the load-balance problem is compounded. Any dynamic load-balancing schemerequires additional computational overhead and data movement.

In summary, the SD algorithm, like the AD and FD ones, evenly divides the MD compu-tations across all the processors. Its chief benefit is that it takes full advantage of the local

45

8.4 Load-balance 8 SPATIAL-DECOMPOSITION ALGORITHM

nature of the interatomic forces by performing only local communication. Thus, in the largeN limit, it achieves optimal O(N/P ) scaling and is clearly the fastest algorithm. However,this is only true if good load-balance is also achievable. Since its performance is sensitive tothe problem geometry, algorithm S1 is more restrictive than A2 and F2 whose performance isgeometry-independent. A second drawback of algorithm S1 is its complexity; it is more difficultto implement efficiently than the simpler AD and FD algorithms. In particular the communica-tion scheme requires extra coding and bookkeeping to create messages and access data receivedfrom neighboring boxes. In practice, integrating algorithm S1 into an existing serial MD codecan require a substantial reworking of data structures and code.

46

REFERENCES REFERENCES

References[1] B. J. Alder, T. E. Wainwright, J. Chem. Phys. 27, p. 1208 (1957).

[2] L. Verlet, Phys. Rev. 159, p. 98 (1967).

[3] L. Hannon, G. C. Lee, E. Clementi, J. Sci. Computing 1(2), p. 145 (1986).

[4] M. Karplus, J. A. McCammon, Scientific American April 1986, p. 42 (1986).

[5] See, for example, J. P. Hansen and I. R. McDonald, Theory of simple liquids, 2nd Ed.,Academic, 1986. A classic book on liquids, with a chapter devoted to computer Simulation.

[6] See for instance N. W. Ashcroft, N. D. Mermin, Solid state physics, Brooks Cole, 1976.

[7] M. P. Allen and D. J. Tildesley, Computer simulation of liquids Oxford, 1987. This is aclassic how to book containing a large number of technical details on a vast array of tech-niques.J. M. Haile, Molecular dynamics simulation, Wiley, 1992. An excellent general introduc-tion to molecular dynamics. Not as thorough as the previous one in explaining techniques,but very deep in explaining the fundamentals, including ”philosophical” aspects. Probablythe best starting point. A good amount of Fortran 77 code is included.

[8] H. J. C. Berendsen and W. F. van Gunsteren, in Molecular dynamics simulation ofstatistical-mechanical systems, G. Ciccotti and W. G. Hoover (Eds.), North-Holland, 1986.

[9] D. Frenkel and B. Smit, Understanding Molecular Simulation, 2nd ed., Academic, SanDiego, 2002.

[10] R. W. Hockney and J. W. Eastwood Computer Simulation Using Particles, Adam Hilger,New York, 1988.

[11] J. E. Barnes and P. Hut, Nature 324, p. 446 (1986).

[12] L. Greengard and V. Rokhlin, J. Comp. Phys. 73, p. 325 (1987).

[13] R. W. Hockney, S. P. Goel and J. W. Eastwood, J. Comp. Phys. 14, p. 148 (1974).

[14] http://www.charmm.org

[15] http://www.igc.ethz.ch/GROMOS/index

[16] http://ambermd.org

[17] http://www.gromacs.org

[18] http://www.ks.uiuc.edu/Research/namd

[19] E. A. Mastny, J. J. de Pablo, J. Chem. Phys., 127, p. 104504 (2007).

47

REFERENCES REFERENCES

[20] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. otto, J. K. Salmon, and D. W. Walker,Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs,NJ (1988).

[21] H. Schreiber,O. Steinhauser and P. Schuster, Parallel Computing 18, p.557-573 (1992).

[22] J. G. Lewis and R. A. van de Geijn, Distributed memory matrix-vector multiplication andconjugate gradient algorithms. In Proc. Supercomputing. ’93, pages 484-492. IEEE Com-puter Society Press (1993).

[23] P. Tamayo and R. Giles, A parallel scalable approach to short-range molecular dynamicson the CM-5. In Proc. Scalable High Performance Computing Conference-92, p. 240-245.IEEE Computer Society Press (1992).

[24] D. Brown, J. H. R. Clarke, M. Okuda and T. Yamazaki, A domain decomposition paral-lelization strategy for molecular dynamics simulations on distributed memory machines.Comp. Phys. Comm. 73, p. 67-80 (1993).

48

Documents

What is Molecular Dynamics? · 1.2 What is molecular dynamics? We call molecular dynamics (MD) a computer simulation technique where the time evolution ... 1.3 Some historical notes