Molecular models, threads and you

Molecular Models, Threads and You

Jiahao Chen

Martínez GroupDept. Chemistry, CATMS, MRL and Beckman

CS 498 MG presentation: 2007-12-07

Optimizing the TINKER classical molecular dynamics code while maintaining code readability

Molecular models/force fields

covalent bond effectsE =

+

Typical energy function

noncovalent interactions

Molecular models/force fields

bond stretch angle torsion dihedrals

electrostatics dispersion

E = !

a!angles

!a("a ! "eq,a)2!

b!bonds

kb(rb ! req,b)2

!

i<j!atoms

qiqj

rij

!

d!dihedrals

!

n

lnd cos (n!)

+ -

++

+ +

Typical energy function

!

i<j!atoms

!ij

"#"ij

rij

$12

!#

"ij

rij

$6%

computation cost = O(N2)

• The state of the system is given by the position and momentum of every atom (of mass )

• Solve the system of partial differential equations

• with user-specified initial conditions (e.g. with constant temperature and pressure)

• Subject to (user-specified) constraints, e.g. fixed bond angles

Problem description

(x1, p1, x2, p2, · · · , xN , pN ) ! R3!2!N

!xi

!t=

pi

mi,!pi

!t= ! !E

!xi, i = 1, · · · , N

mi

Many parallel and serial implementations

Package name Threads MPI GlobalArrays

NAMD CHARM++GROMACS ✓ ✓

TINKERAMBER partly ✓ ✓

CHARMM ✓LAMMPS ✓

NWChem ✓ ✓

Things I tried

• Compiler flags optimization

• Cache miss reduction

• Lookup tables

• Parallelization with OpenMP

Compiler flag optimizationflags gfortran 4.1.2 ifort 10.0.023

-O0 29.95(2) s - 36.30(2) s -

-Os 29.92(3) s +0.77(3) % 32.59(4) s +10.22(2) %

-O1 30.22(1) s -0.90(4) % 32.12(3) s +11.51(1) %

-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %

-O3 29.84(2) s +0.38(2) % 30.83(2) s +15.06(2) %

CE search 28.77(2) s +3.62(3) %1 28.96(2) s +20.22(1)%2

1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load -fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”

2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”

Algorithm and time profile

>98%

for each time step

>59% <31%

Initialize model and parameters

EndMove one time step

Enforce temp. & pressure

Flush I/O

Update state by t/2

Calculate potential energy

and forces

Calculate & record kinetic energy and

properties

Update state by t/2


Remove unphysical motions

Calculate charge

interactions

Calculate dispersion

interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions

Add up all compo-nents

...

37%12% 8%9% 26%

O(N2) O(N)

N = 6gfortran 4.1.2

O(N)O(N2)


An unexpected cost

>98%

for each time step

>59% <31%




Flush I/O

Update state by t/2


and forces


properties

Update state by t/2



Calculate charge

interactions


interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

37%12% 8%9% 26%

O(N2) O(N)

N = 6

O(N)O(N2)

Text

Q: Why is 15% of total execution time spent adding

numbers!?

A: many L2 cache missesc zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 19 desum(j,i) = deb(j,i) + ... 2 derivs(j,i) = desum(j,i) end do end do

70 of 91 cache misses per time step (n = 6) shown

22 other terms

22 other terms

A simple solutionc zero out each of the first derivative components 7 do i = 1, n do j = 1, 326 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 6 temp = deb(j,i) + ... 1 19 desum(j,i) = temp 1 2 derivs(j,i) = temp end do end do

reduced cache misses from 92 to 41 per time step

Speedup from reducing L2 cache misses

flags gfortran 4.1.2 ifort 10.0.023

original

with scalar replacement

speedup

29.95(2) s 28.96(2) s

27.43(3) s 28.95(1) s

+8.44(1) % +0.03(2) %

ifort already called with scalar replacement flag

Lookup tables (LUTs)

• Calculations of sqrt() and exp() take up 23.8% of execution time

• Idea: pre-compute values of sqrt() and exp() in an array and recall them from memory when needed

• Caution: LUT should not displace too much data from L2 cache

sqrt() with LUTLUT with linear interpolationdirect LUT

exp() with LUTLUT with first-order Taylor

series refinement*direct LUT

ex = ex0 + (x! x0)ex0 +O!(x! x0)2

"

Choice of implementation

function desired precision

table size

(doubles)

refinement expected speedup

sqrt()

exp()

10-4 10,764 none +118%

10-8 6,836 Taylor +151%

LUT aligned to 128-bitsL2 cache = 4 MB = 512K doubles

Speedup from LUT use

flags gfortran 4.1.2 ifort 10.0.023

original

with lookup tables

speedup

29.95(2) s 28.96(2) s

26.89(1) s 25.87(2) s

+10.23(2) % +7.22(3) %

Summary of serial improvements

Improvement gfortran 4.1.2 ifort 10.0.023

Best compiler flags +3.62(3) % +20.22(1) %

L2 cache miss reduction

+8.44(2) % +0.03(1) %

Lookup tables +10.23(1) % +7.22(2) %

Total 23.91(3) s+20.17(4) %

26.86(2) s+26.00(2) %


Parallelization targets

>98%

for each time step

>59% <31%




Flush I/O

Update state by t/2


and forces


properties

Update state by t/2



Calculate charge

interactions


interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

37%12% 8%9% 26%

O(N2) O(N)

N = 6

O(N)O(N2)

Text

Parallelization strategy



and forces

Calculate charge

interactions


interactions

Calculate bond

interactions

Calculate angle

interactions

Calculate dihedral

interactions...

omp sections

omp parallel do

12%16% 11%50%

omp parallel doomp parallel do

omp parallel do

omp parallel do

omp section

omp section

2%

50%

50%

100%

Parallelization results

5

10

15

20

25

30

35

0.5 1 1.5 2 2.5 3 3.5 4 4.5

N=6N=1000Ideal

Exe

cutio

n tim

e/s

# cores

gfortran 4.1.2

Summary

• Free software can sometimes be better than non-free software

• L2 cache misses can significantly degrade performance

• Lookup tables are an effective tradeoff between speed and memory vs. precision

• Simple OpenMP parallelization is effective for small numbers of processors

Technology

Molecular models, threads and you