Upload
jiahao-chen
View
112
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Molecular Models, Threads and You
Jiahao Chen
Martínez GroupDept. Chemistry, CATMS, MRL and Beckman
CS 498 MG presentation: 2007-12-07
Optimizing the TINKER classical molecular dynamics code while maintaining code readability
Molecular models/force fields
covalent bond effectsE =
+
Typical energy function
noncovalent interactions
Molecular models/force fields
bond stretch angle torsion dihedrals
electrostatics dispersion
E = !
a!angles
!a("a ! "eq,a)2!
b!bonds
kb(rb ! req,b)2
!
i<j!atoms
qiqj
rij
!
d!dihedrals
!
n
lnd cos (n!)
+ -
++
+ +
Typical energy function
!
i<j!atoms
!ij
"#"ij
rij
$12
!#
"ij
rij
$6%
computation cost = O(N2)
• The state of the system is given by the position and momentum of every atom (of mass )
• Solve the system of partial differential equations
• with user-specified initial conditions (e.g. with constant temperature and pressure)
• Subject to (user-specified) constraints, e.g. fixed bond angles
Problem description
(x1, p1, x2, p2, · · · , xN , pN ) ! R3!2!N
!xi
!t=
pi
mi,!pi
!t= ! !E
!xi, i = 1, · · · , N
mi
Many parallel and serial implementations
Package name Threads MPI GlobalArrays
NAMD CHARM++GROMACS ✓ ✓
TINKERAMBER partly ✓ ✓
CHARMM ✓LAMMPS ✓
NWChem ✓ ✓
Things I tried
• Compiler flags optimization
• Cache miss reduction
• Lookup tables
• Parallelization with OpenMP
Compiler flag optimizationflags gfortran 4.1.2 ifort 10.0.023
-O0 29.95(2) s - 36.30(2) s -
-Os 29.92(3) s +0.77(3) % 32.59(4) s +10.22(2) %
-O1 30.22(1) s -0.90(4) % 32.12(3) s +11.51(1) %
-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %
-O3 29.84(2) s +0.38(2) % 30.83(2) s +15.06(2) %
CE search 28.77(2) s +3.62(3) %1 28.96(2) s +20.22(1)%2
1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load -fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”
2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”
Algorithm and time profile
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions
Add up all compo-nents
...
37%12% 8%9% 26%
O(N2) O(N)
N = 6gfortran 4.1.2
O(N)O(N2)
Add up all compo-nents
An unexpected cost
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
37%12% 8%9% 26%
O(N2) O(N)
N = 6
O(N)O(N2)
Text
Q: Why is 15% of total execution time spent adding
numbers!?
A: many L2 cache missesc zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 19 desum(j,i) = deb(j,i) + ... 2 derivs(j,i) = desum(j,i) end do end do
70 of 91 cache misses per time step (n = 6) shown
22 other terms
22 other terms
A simple solutionc zero out each of the first derivative components 7 do i = 1, n do j = 1, 326 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 6 temp = deb(j,i) + ... 1 19 desum(j,i) = temp 1 2 derivs(j,i) = temp end do end do
reduced cache misses from 92 to 41 per time step
Speedup from reducing L2 cache misses
flags gfortran 4.1.2 ifort 10.0.023
original
with scalar replacement
speedup
29.95(2) s 28.96(2) s
27.43(3) s 28.95(1) s
+8.44(1) % +0.03(2) %
ifort already called with scalar replacement flag
Lookup tables (LUTs)
• Calculations of sqrt() and exp() take up 23.8% of execution time
• Idea: pre-compute values of sqrt() and exp() in an array and recall them from memory when needed
• Caution: LUT should not displace too much data from L2 cache
sqrt() with LUTLUT with linear interpolationdirect LUT
exp() with LUTLUT with first-order Taylor
series refinement*direct LUT
ex = ex0 + (x! x0)ex0 +O!(x! x0)2
"
Choice of implementation
function desired precision
table size
(doubles)
refinement expected speedup
sqrt()
exp()
10-4 10,764 none +118%
10-8 6,836 Taylor +151%
LUT aligned to 128-bitsL2 cache = 4 MB = 512K doubles
Speedup from LUT use
flags gfortran 4.1.2 ifort 10.0.023
original
with lookup tables
speedup
29.95(2) s 28.96(2) s
26.89(1) s 25.87(2) s
+10.23(2) % +7.22(3) %
Summary of serial improvements
Improvement gfortran 4.1.2 ifort 10.0.023
Best compiler flags +3.62(3) % +20.22(1) %
L2 cache miss reduction
+8.44(2) % +0.03(1) %
Lookup tables +10.23(1) % +7.22(2) %
Total 23.91(3) s+20.17(4) %
26.86(2) s+26.00(2) %
Add up all compo-nents
Parallelization targets
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
37%12% 8%9% 26%
O(N2) O(N)
N = 6
O(N)O(N2)
Text
Parallelization strategy
Add up all compo-nents
Calculate potential energy
and forces
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
omp sections
omp parallel do
12%16% 11%50%
omp parallel doomp parallel do
omp parallel do
omp parallel do
omp section
omp section
2%
50%
50%
100%
Parallelization results
5
10
15
20
25
30
35
0.5 1 1.5 2 2.5 3 3.5 4 4.5
N=6N=1000Ideal
Exe
cutio
n tim
e/s
# cores
gfortran 4.1.2
Summary
• Free software can sometimes be better than non-free software
• L2 cache misses can significantly degrade performance
• Lookup tables are an effective tradeoff between speed and memory vs. precision
• Simple OpenMP parallelization is effective for small numbers of processors