Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Implementation of Scientific Computing Applications on theComputing Applications on the Cell Broadband Engine processor
Guochun Shi,Volodymyr Kindratenko
National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
Three scientific applications
• Nanoscale Molecule Dynamics (NAMD)• James Philips from Theoretical and ComputationJames Philips from Theoretical and Computation
Biophysics group, UIUC
• MIMD Lattice Computation (MILC)• MIMD Lattice Computation (MILC)• Steven Gottlieb from physics department, Indiana
University
• Electron Repulsion Integral (ERI) in quantum chemistry• Ivan S Ufimtsev Todd J Martinez Chemistry• Ivan S. Ufimtsev, Todd J.Martinez, Chemistry
department, UIUC
2
Presentation outline
• Introduction• Cell Broadband EngineCell Broadband Engine• Target Applications
1. NAnoscale Molecule Dynamics (NAMD)2 MIMD Lattice Computation (MILC) collaboration2. MIMD Lattice Computation (MILC) collaboration3. Electron Repulsion Integral (ERI) in quantum chemistry
• In-house task library and task dispatch system
• Implementation and performance• Application porting and optimization
• Summary• Conclusions
3
Cell Broadband Engine
• One Power Processor Element (PPE) and eight Synergistic
Memory Interface Controller (MIC)
I/O Interface (IOIF0)
I/O Interface (IOIF1)(PPE) and eight Synergistic
Processing Elements (SPE), each SPE has 256 KB local storage
Element Interconnect Bus (EIB)
l
• 3.2 GHz processor • 25.6 GB/s processor-to-
memory bandwidth
PowerPC ProcessorStorage Subsystem (PPSS)
L1 L1
L2 cache
Memory Flow Controller (MFC)
Local Store (LS)
DMA Controller
memory bandwidth • > 200 GB/s EIB sustained
aggregate bandwidth• Theoretical peak performance
PowerPC Processor Element (PPE)
PowerPC ProcessorUnit (PPU)
instruction cache
datacache
Synergistic Processor Element (SPE) 0
Synergistic Processor Unit (SPU)
Local Store (LS)
• Theoretical peak performance of 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)
SPE 7
4
In-house task library and dispatch system
typedef struct task_s {int cmd; // operandi t i // th i f t k t t
int ppu_task_init(int argc, char **argv,spe_program_handle_t ); // initializationint ppu_task_run(volatile task_t * task); // start a task in all SPEsi t t k ( l til t k t* t k i t ) // t t t k i SPE
API for PPE and SPECompute task struct
int size; // the size of task structure} task_t;
typedef struct compute_task_s {task_t common;
int ppu_task_spu_run(volatile task_t* task, int spe); // start a task in one SPEint ppu_task_spu_wait(void); // wait for any SPE to finish, blocking callvoid ppu_task_spu_waitall(void); // wait for all SPEs to finish, blocking all
int spu_task_init(unsigned long long);int spu task register(dotask t,int); // register a task
<user_type1> <user_var_name1><user_type2> <user_var_name2>…
} compute_task_t;
int spu_task_register(dotask_t,int); // register a taskint spu_task_run(void); // start the infinite loop, wait for tasks
Task dispatch system in NAMD
remove patch pair
pool of patch pairs to be processed
pool of idle SPEs
task pool is
yesno
start find idle SPEfind a
dependency‐free patch pair
remove patch pair from the pool
run pair‐compute task on the idle SPE
yes yes
no no
pool is empty
wait until allSPU tasks exit
5
ppu_task_spe_wait() stop
NAMD
• Compute forces and update positions repeatedlypositions repeatedly
• The simulation space is divided into rectangular regions called patchesregions called patches• Patch dimensions > cutoff
radius for non-bonded interaction
• Each patch only needs to be checked against nearby patches• Self-compute and pair-
compute
6
NAMD kernel
• 1: for each atom i in patch pk
NAMD SPEC 2006 CPU benchmark kernel
• 2: for each atom j in patch pl
• 3: if atoms i and j are bonded, compute bonded forces• 4: otherwise, if atoms i and j are within the cutoff distance, add atom j to the i’s atom pair list• 5: end• 6: for each atom k in the i’s atom pair list• 6: for each atom k in the i s atom pair list• 7: compute non-bonded forces (L-J potential and PME direct sum, both vial lookup tables)• 8: end• 9: end
• 1: for each atom i in patch pk
We implemented a simplified version of the kernel that excludes pairlists and bonded forces
• 2: for each atom j in patch pl
• 3: if atoms i and j are within the cutoff distance• 4: compute non-bonded forces (L-J potential and PME direct sum, both via lookup tables)• 5: end• 6: end
7
• 6: end•
NAMD Implementation: SPU Data movement for distance
• SIMD: each component is kept in a separate vector
Data movement for distancecomputation (SP case)
x0 y0 z0 c0 id0 x1 y1 z1 c0 i0 x2 y2 z2 c0 i0 x3 y3 z3 c0 i0Data in memorykept in a separate vector
• Data movement dominates the time
• Buffer size carefully chosen
Shuffle operation
s
• Buffer size carefully chosen to fit into the local store
x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 c0 c1 c2 c3
vectors
Code(KB)
L-J table (KB)
Table_four(KB)
Atom_buffer(KB)
Force buffer (KB)
Stack others
dx0 dx1 dx2 dx3
Comput-ation
dy0 dy1 dy2 dy3 dz0 dz1 dz2 dz3
Local store usage
(KB) (KB) (KB) (KB)
SP 25 55 45 30 18 83
DP 25 55 91 48 18 19
8
r0 r1 r2 r3
distance vector
Implementation: optimizations
• Different vectorization schemes are applied in
task start
DMA in atoms and forces for the patch
schemes are applied in order to get best performance
Self-compute?
compute distances for 4 pairs of atoms
compute distances for 4 pairs of atoms
Yes No
• Self-compute: do redundant computations and fill zeros to unused
any of them within cutoff?
compute 4 non-bonded forces
yes
no
Number of save pairs
>= 4?
Save pairs within the cutof f distance
no
slots• Pair-compute: save enough
pairs of atoms, then do
nullify forces for those outside of cutof f distance
all pairs
4?
compute 4 non-bonded forces
all pairsno noyesyes
yes
calculations DMA forces back to
main memory
all pairs processed?
task exit
all pairs processed?
9
task exit
NAMD Performance: static analysis
3.5E+10
4E+10scalar force computationscalar distance computationSIMD force computation
3.5E+10
4E+10
data manupulation cycles
compute cycles
1.5E+10
2E+10
2.5E+10
3E+10
ber o
f clock cycles
SIMD force computationmerge SIMD distance computation
1 5E+10
2E+10
2.5E+10
3E+10
r of clock cycles
compute cycles
0
5E+09
1E+10
single‐precision single‐precision double‐precision double‐precision
numb
0
5E+09
1E+10
1.5E+10
numbe
r• Distance computation code takes most of the time
single precision self‐compute
single precision pair‐compute
double precision self‐compute
double precision pair‐compute
single‐precision self‐compute
single‐precision pair‐compute
double‐precision self‐compute
double‐precision pair‐compute
Distance computation code takes most of the time• Data manipulation time is significant
10
NAMD PerformanceScaling and speedup of the force-
field kernel as compared to a3.0 GHz Intel Xeon processor
25.8
20
25
30
14.0
16.0
18.0
20.0
cond
s)
speedupexecution time
4038.037.6 38.9
pNAMD kernel performance on different architectures
1 0 1.73.4
6.8
13.4
5
10
15
2 0
4.0
6.0
8.0
10.0
12.0
spee
dup
execution time (sec
20
25
30
35
40
28.6
16.9
12 2
22.319.0
single‐precision kernel
double‐precision kernel
time (secon
ds)
1.0
00.0
2.0
Intel Xeon
(3.0GHz)
1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs
0
5
10
15
Cell/B.E. Intel IBM Intel Xeon ( )
AMD Cell/B.E. 8
12.28.6
0.9
11.1
1.6execution
22.5
15
20
25
30
10.0
12.0
14.0
16.0
18.0
20.0
dup
ime (secon
ds)
speedupexecution time
PPE (3.2GHz)
Itanium 2 (900MGz)
Power4 (1.2GHz)
(3.0GHz) Opteron (2.2GHz)
SPEs (3.2GHz)
• 13.4x speedup for SP and 11.6x
1.0 1.52.9
5.8
11.6
0
5
10
0.0
2.0
4.0
6.0
8.0
Intel 1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs
spee
d
execution t
speedup for DP compared to a 3.0 GHz Intel Xeon processor
• SP performance is < 2x better than DP
11
Intel Xeon
(3.0GHz)
1 SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs
MILC Introduction
• MILC studies large scale numerical simulations to study quantum chromodynamics (QCD) the theory ofchromodynamics (QCD), the theory of the strong interactions of subatomic physics.
• Large cycle consumer of• Large cycle consumer of supercomputers
• Our target applicationcompute loop 1
CPUMPI scatter/gather for loop 2
MPI scatter/gather for loop 3• dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics Ralgorithm
MPI scatter/gather for loop 3
compute loop 2
MPI scatter/gather for loop n+1
• Our view of the MILC applications• A sequence of communication
and computation blocks
compute loop n
MPI scatter/gather
MPI scatter/gather for loop n+1
12
pOriginal CPU‐based implementation
MILC Performance in PPE
• Step 1: try to run it in PPE Runtime in CPU and PPE
400 0
450.0
• In PPE it runs approximately ~2-3x 100.0
150.0
200.0
250.0
300.0
350.0
400.0
run
time
(sec
onds
)
CPU
PPE
approximately 2 3x slower than modern CPU 0.0
50.0
8x8x16x16 16x16x16x16
lattice size
• MILC is bandwidth-bound
It ith h t
Stream benchmark on CPU and PPE
2500300035004000
(GB
/s)
CPU• It agrees with what we see with stream benchmark
0500
100015002000
Copy Scale Add Triad
Ban
dwid
th CPU
PPE
13
MILC: Execution profile and kernels to be ported
20
25, % 70
8090100
, %
kernel run‐time on 8x8x16x16 lattice
kernel run time on 16x16x16x16 lattice
5
10
15
kernel time
102030405060
cumulative
,kernel run‐time on 16x16x16x16 lattice
cumulative for 8x8x16x16 lattice
cumulative for 16x16x16x16 lattice
0
udad
u_m
u_nu
h_w_
s ite_
s pec
ial
s u3m
at_c
opy
mul
t_su
3_nn
mult
_su3
_na
mul
t_th
is_ld
u_s it
e
dadu
_mat
_mu_
nu
s ingl
e ac
tion
s u3_
adjo
int
e ral
strid
ed g
athe
rf_
mu_
nu
mult
_add
_wve
c
isu
3_m
atrix
_add
i
d_co
ngra
d2_c
l
s et_
neig
hbor
mul
t_su
3_an
upda
te_u
com
pute
_clov
upda
te_h
_cl
dd_w
vec_
mag
sq
realt
race
_su3
_nn
add_
su3_
mat
rixwp
_shr
ink
ags q
_wve
c_ta
sk
gaug
e_ac
tion
u3_m
atrix
to ze
rore
unita
rize
010
dslas
h_ m uda
gene
rs c
alar
_
s cal
ar_m
ulti_
s cal
ar_m
ult_a
d re mag
s et s
u3
10 of these subroutines are responsible for >90% of overall runtime• 10 of these subroutines are responsible for >90% of overall runtime• su3mat_copy, i.e. memcpy responsible for nearly 20% of all runtime• All kernels responsible for 98.8%
14
MILC: Kernel memory access pattern
• Neighbor data access taken care of by MPI framework
#define FORSOMEPARITY(i,s,choice) \for( i=((choice)==ODD ? even_sites_on_node : 0 ), \s= &(lattice[i]); \i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \i++,s++)care of by MPI framework
• In each iteration, only small elements are accessed
FORSOMEPARITY(i,s,parity) {mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp );mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) );mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp );mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu );su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)),
((su3 matrix*)F PT(s mat)) );elements are accessed• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson vector: 96 bytes
((su3_matrix )F_PT(s,mat)) );}
lattice site 0
One sample kernel from udadu_mu_nu() routine
• wilson_vector: 96 bytes
• Challenge: how to get data into SPUs as fast as possible?SPUs as fast as possible?• Data is nonaligned• Daa is not multiple of 128
bytesData accesses
Data from neighbor
15
bytes
Approach I: packing and unpacking
PPE and main memorystruct site
SPEs
…site
Packing
DMA operations …
Unpacking
DMA operations
DMA ti
struct site
Unpacking DMA operations…
• Good performance in DMA operations• Packing and unpacking are expensive in PPE
16
Approach II: Indirect memory access
SPEsPPE and main memory
Original lattice
Modified lattice
……
Continuous mem …
DMA operations
• Replace elements in struct site with pointers• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPU overhead due to indirect memory access
17
y
Approach III: Padding and small memory DMAs
Original lattice
SPEs
…
PPE and memory
Lattice after padding
……
P ddi l t t i t i
…
DMA operations
• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3 matrix from 3x3 complex to 4x4 complex matrixSu3_matrix from 3x3 complex to 4x4 complex matrix
• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%
• Wilson_vector from 4x3 complex to 4x4 complex98 b t 128 b t
18
• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%
MILC struct site padding
• 128 byte stride access has different performanceThi i d t 16 b k i i• This is due to 16 banks in main memory
• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes• We choose to pad the struct site to 2688 (21 128) bytes
25.38
12.69
25.36
8.6
25.49
12.89
25.48
4.26
25.5
12.93
25.34
8.58
25.51
12.89
25.47
2 13
25.34
10
20
30
idth (GB/s)
4. 6 2.13
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
stride (x 128 bytes)
ban
dwi
19
MILC Kernel performance
• GFLOPS are low for all kernels• Bandwidth is around 80% of peak for
t f k l7
8
9
10
Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice
most of kernels• Kernel speedup compared to CPU for
most of kernels are between 10x to 20x• set_memory_to_zero kernel has ~40x
speedup 0
1
2
3
4
5
6
…speedup• Memcpy speedup >15x uda
du_mu
_nu
dslash
_w_si
te_spe
cial
su3ma
t_copy
mult_
su3_nn
mult_
su3_na
mult_
this_l
du_site
udadu_
mat_m
u_nu
single
actio
n
su3_ad
joint
genera
l strid
ed gat
her
f_mu_n
u
scalar
_mult
_add_w
vec
scalar
_mult
i_su3_
matrix
_addi
d_cong
rad2_c
l
set_ne
ighbor
mult_
su3_an
update
_u
comput
e_clov
update
_h_cl
scalar
_mult
_add_w
vec_m
ag…
realtra
ce_su3
_nn
add_su
3_matr
ix
wp_sh
rink
magsq
_wvec
_task
gauge_
action
set su
3_matr
ix to z
ero
reunit
arize
Peakbandwidth (%)S peedup
50
60
70
80
90
100
Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice
S peedup
253035
404550
8x8x16x16 lattice 16x16x16x16 lattice
0
10
20
30
40
udadu_
mu_nu
w_site
_speci
al
su3ma
t_copy
mult_
su3_nn
mult_
su3_na
_this_
ldu_si
te
u_mat_
mu_nu
single
actio
n
su3_ad
joint
stride
d gath
er
f_mu_n
u
ult_ad
d_wvec
3_matr
ix_add
i
d_cong
rad2_c
l
set_ne
ighbor
mult_
su3_an
update
_u
comput
e_clov
update
_h_cl
_wvec
_mags
q
ltrace_
su3_nn
d_su3_
matrix
wp_sh
rink
sq_wv
ec_tas
k
gauge_
action
matrix
to ze
ro
reunit
arize
05
10
152025
20
u
dslash
_w mult_
udadu
genera
l s
scalar
_mu
scalar
_mult
i_su3 d c
scalar
_mult
_add_ real ad ma
g s
set su
3_m
MILC Application performance
• Single Cell Application performance speedup 5.00
6.00
ec)
SPE contribution38.69
8x8x16x16 lattice
p p• ~8–10x, compared to Xeon single core
0 821.48 1.71
2.2438.24
4.64
2.971.91
1.72
1.00
2.00
3.00
4.00
execution time (se
• Profile in Xeon• 98.8% parallel code, 1.2% serial code
speedup slowdown
0.45 0.820.00
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)execution mode
16x16x16x16 lattice
• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell
15.00
20.00
25.00
time (sec)
SPE contribution
55.8
168.57
16x16x16x16 lattice
PPE is standing in the way for further improvement
1.886.15 5.94 6.85
10.60166.69
49.65 11.52 6.82
6.38
0.00
5.00
10.00
1 core Xeon 8 cores Xeon 1 PPE 8 1 PPE 16 2 PPEs 8
execution t
21
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)execution mode
Introduction –Quantum Chemistry
• Two basic questions in Chemistry* :1. Where are the electrons?2 Where are the nuclei?2. Where are the nuclei?
Quantum Chemistry focuses the first question by solving the time-independent Schrödinger equation to get the electronic wave functions.g
And the absolute squire is interpreted as the probability distribution for the positions of the electrons.
• The probability distribution function is usually expanded to Gaussian type basis functions. Gaussian type basis functions
• To find the coefficients in the above expansion, we need do lots of two electron repulsion integrals
two Electron Repulsion Integral (ERI)
22
p g ( )
* http://mtzweb.scs.uiuc.edu/research/gpu/
Introduction – Electron Repulsion Integral (ERI)
for (s1 = startShell; s1 < stopShell; s1++)for (s2 = s1; s2 < totNumShells; s2++)for(s3 = s1; s3 < totNumShells; s3++)
Reference CPU implementation
for(s4=s3; s4 < totNumshells; s4++)
for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p4 0 p4< n mPrimiti es[s4] p4++)
The general form for [ss|ss] integral
for (p4=0;p4< numPrimitives[s4]; p4++){
…..H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*
I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;}
where
• Four outer loops sequence through all unique combinations of electron shells
• Four Inner loops sequence through all shell primitives
• The primitives [ss|ss] are computed and summed up in the core code.
23
p
Porting to Cell B.E. – load balance SPEs
• Function offload programmingWork distribution computation
SPEsPPE
• Each SPE is assigned a range of (s1,s2,s3,s4) to work on.
…
• Load balance Computation in PPE• Each primitive integral roughly has
the same amount of computation
Collect results from SPEs
continue
• Each contracted integral may have different amount of computation
• Each iteration in outer most loop
for (s1 = startShell; s1 < stopShell; s1++)for (s2 = s1; s2 < totNumShells; s2++)for(s3 = s1; s3 < totNumShells; s3++)
for(s4=s3; s4 < totNumshells; s4++)
for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)
has different amount of computation
• PPE must compute the amount of computation before SPEs run
(p ;p [ ]; p )for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++)
{
…..
H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*
I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;
24
p oeff2 Coeff3 Coeff4;}
SPE kernels
• Is local store enough:Inp t data an arra of coordinates + an arra of shells + an arra• Input data: an array of coordinates + an array of shells + an array of primitives < 32 KB
• The Gauss error function -- erf() – is evaluated by interpolating table The table size is < 85KBtable. The table size is < 85KB
• We have enough local store
• Precomputing is necessary to reduce redundant computationcomputation• Precomputed intermediate results are much larger than local
store
25
SPE kernel -- precompute
• Instead of computing every primitive integrals from equations,
for (s1 = startShell; s1 < stopShell; s1++)for (s2 = s1; s2 < totNumShells; s2++)for(s3 = s1; s3 < totNumShells; s3++)
for(s4=s3; s4 < totNumshells; s4++)p g qwe precompute pairwise quantities and store them in main memory
for(s4=s3; s4 < totNumshells; s4++)
DMA in precomputed pair quantities.
for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)
• DMA all pairwise quantities before computing a contracted integral
P t d titi
(p ;p [ ]; p )for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++){
…..H_ReductionSum[s1,s2,s3,s4 += sqrt(F_PI*rho)*
I1*I2*I3*W i ht*C ff1*C ff2*C ff3*C ff4• Precomputed quantities: I1*I2*I3*Weight*Coeff1*Coeff2*Coeff3*Coeff4;}
2122121 )/(2/3
2121 )( Redd
rαααα
ααπ +−
+
21 αα +
• Trade bandwidth with computation
)/()( 212211 αααα ++ RRrr
26
SPE kernel – inner loops optimizations
• Naïve way of SIMDing kernel• Use a counter to keep track of the
number of primitive integrals
for (p1=0;p1< numPrimitives[s1]; p1++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p4=0;p4< numPrimitives[s4]; p4++){ static int count =0;number of primitive integrals
• If counter >=4, do a 4-way computation
• Loop switch• If one of the loops’ length is
{ static int count 0;count++;if (count == 4){compute 4 primitive integrals using SIMD intrinsics
}}• If one of the loops length is
multiple of 4, we switch the loop to the inner most
• Advance in increment of 4, and get rid of counter
…..DMA in precomputed pair quantities.
Naïve implementation of SIMD kernel
• Loop unrolling• If the numPrimitives are the same
for all primitive integral and the happened to be some nice
for (p4=0;p4< numPrimitives[s4]; p4++)for (p2=0;p2< numPrimitives[s2]; p2++)for (p3=0;p3< numPrimitives[s3]; p3++)for (p1=0;p1< numPrimitives[s1]; p1+=4)
{happened to be some nice number
• Completely unroll the inner loops generate the most efficient code
{Compute 4 primitive integrals in SIMD intrinsics
}
27
Loop switch: loop1 length is multiples of 4, switch loop1 and loop4
Performance resultsModel1 Model2
# of Atoms 30 64
Basis set 6-311G STO-6G
# of integrals 528,569,315 2,861,464,320
# of reduction elements 3,146,010 2,207,920
Xeon (2.33Ghz) 21.04 112.55
Cell-blade (16 SPEs) 1.1 0.92
Speedup 19x 122x.
• Model1 is 10 water molecules (30 atoms in total), model2 is 64 hydrogen atom arranged in latticelattice
• Model1 has non-uniform contracted integral intensity, ranging from 4096 to just 1 primitive integrals.
• Loop switching is not always possible naïve SIMD implementation more control overhead• Loop unrolling is not possible since the # of iterations in each inner loop changes• Precomputing proves to slow down due to DMA overhead
• Model2 uniform computation intensity• Precomputing and loop unrolling proves to be efficient• Module outperforms GPU implementation*
28
* Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. Ivan S. Ufimtsev and Todd J. Martinez, Journal of Chemical Theory and Computation, February 2008
Summary
Ported code Precision Performance Limit factor
Cell Blade to single core
d 1
Cell blade to Xeon blade
d 2speedup1 speedup2
NAMD Modified SP,DP Computation 22-26x NULLkernel
MILC application SP Bandwidth 10-12x 1.5-4.1x
ERI Kernel SP Computation 19-122x NULL
1. CPU comparing core for NAMD 3.0Ghz Xeon, for MILC and ERI it is 2.33 Ghz Xeon
29
2. One Xeon blade contains 8 cores with 2MB L2 cache per core.
Lessons learned
• Keep code on PPE to a minimum• 1.2% runtime code 33-62% in PPE in MILC
• Find out the limiting factor in the kernel and work on it• MILC is bandwidth limited and we focus on improving the usable bandwidth• NAMD and ERI is compute-bound and we focus on optimizing the SIMD kernel• NAMD and ERI is compute-bound and we focus on optimizing the SIMD kernel.
• Sometimes we can trade bandwidth with computation or vice versa • In ERI, depends on input data, we can either precompute some quantities in PPE
d DMA i d ll t ti i SPEand DMA in or do all computation in SPEs
• Application data layout in memory in critical• Padding would not be necessary if MILC is field centered improvement of g y p
performance and productivity• Proper data layout makes SIMDizing kernel code easier and make it faster
30
Thank You
• Questions?
31