View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Application Performance Analysis on Blue Gene/L
Jim Pool, P.I.Maciej Brodowicz, Sharon Brunett,
Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Ed Upchurch
Caltech’s Role in Blue Gene/L Project
• Understand implications of BG/L network architecture & drive results from real world ASCI applications
• Develop statistical models of applications, processors as message generators, and the network
• Focus on– Application communications distribution– Network contention as function of load, size and
adaptive routing• Represent 64K Nodes Explicitly in Statistical Model• Create trace analysis tools to characterize applications
– Extensible Trace Facility (ETF)
ETF Built-in Trace Options• MPI events
– All point-to-point communications (MPI-1)– All collective communications (MPI-1)– Non-blocking request tracking– Communicator creation and destruction– MPI datatype decoding (requires MPI-2)– Languages: C, Fortran– Easy instrumentation of applications
• Memory reference and program execution tracing– Tracking of statically and dynamically allocated arrays (identifiers,
element sizes, dimensions)– Tracking of scalar variables– Read and write accesses to individual scalars and array elements
as well as contiguous vectors of elements– Function calls– Program execution phases
ETF Tracing Example forMagnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR)• Parallel MHD fluid code solves equations of
hydrodynamics and resistive Maxwell’s equations– Part of larger application which computes dynamic
responses to strong shock waves impinging on target materials
– Fortran 90 + MPI
– MPI Cartesian communicators
– Nearest neighbor comms use non blocking send/recv
– MPI Allreduce for calculating stable time steps
AMR MHD: Communication Profile20 time steps on 32 processors, 128x128 cells
0
10000
20000
30000
40000
50000
60000
70000
80000
0 2E+10 4E+10 6E+10 8E+10 1E+11 1.2E+11 1.4E+11 1.6E+11 1.8E+11 2E+11
CPU cycles
Co
un
t
point-to-point collective
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
0 2E+11 4E+11 6E+11 8E+11 1E+12 1.2E+12 1.4E+12 1.6E+12 1.8E+12 2E+12
CPU cycles
Co
un
t
point-to-point collective
Max. level = 1
Max. level = 2
Lennard-Jones Molecular Dynamics
• Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms– production code from Sandia National Lab
• Simulations are large in two dimensions– number of atoms and number of time steps
• Spatial decomposition case selected – each processing node keeps track of the positions and movement of the atoms in a
3-D box
• Computations carried out in a single time step correspond to femto-seconds of real time – a meaningful simulation of the evolution of the system’s state typically requires
thousands of time steps
• Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step
• Code is written in Fortran and MPI
rc
(1) (2)
Comms Comp Wait
Time tSYNCH
Typical Grid Cell and Cutoff Radius Communication Steps
Computational Cycle Model
Lennard-Jones Molecular Dynamics
LJS Single Processor BG/L PerformanceOriginal Code vs. Tuned for BG/L
0
2
4
6
8
10
12
15,625 31,250 62,500 125,000 250,000 500,000
Number of Atoms per BG/L CPU
Imp
rov
em
en
t (%
) good cache reuse
LJS Molecular Dynamics PerformanceFixed Problem Size of 1 Billion Atoms
2k 4k 8k 16k 32k 64k
Number of BG/L CPUs
Tim
e p
er s
ing
le it
erat
ion
(m
s)
Compute Time [ms]
Communications Time [ms]
LJS Speedup BG/L vs. ASCI Red 3200 Nodes1 Billion Atom Problem
0
10
20
30
40
50
60
70
80
2k 4k 8k 16k 32k 64k
Number of BlueGene/L Nodes
Sp
eed
up
LJS Communications Time
500,000 Atoms per BG/L Node
0
10
20
30
40
50
60
4x4x4 (64 BGL Nodes) 8x8x8 (512 BGL Nodes) 16x16x16 (4096 BGL Nodes)
BG/L Configuration
Co
mm
un
icat
ion
s T
ime
Per
Iter
atio
n (
mse
cs)
Physical Nearest Neighbor Mapping
Random Mapping
What is QMC and Why is it a Good Fit for BG/L?
• QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy
• Developed at Caltech by Bill Goddard’s ASCI Material Properties group
• Interesting Characteristics– Low memory requirements– After initialization, highly parallel and scalable– Minimal set of MPI calls required
• Non blocking p2p, reduction, probe, communicator, collective calls– No communications during QMC working steps– Communicating convergence statistics is 7200 bytes regardless of problem size and
node count– Code already ported to many platforms (Linux, AIX, IRIX, etc.)
• C++ and MPI sources
Iterative QMC Algorithm
For each processor do:Steps = Total Steps / number of processorsGenerate walkers Equilibrate walkersfor each step
generate QMC statisticssend QMC statistics to master node
QMC Communications TimeFor 100,000 Steps Per Node(Reduce Using the Torus)
0.001
0.01
0.1
18x8x8 (512) 16x16x16 (4K) 32x16x16 (8K) 32x32x16 (16K) 32x32x32 (32K) 64x32x32 (64K)
BG/L Configuration
Tim
e (
secon
ds)
Future Application Porting and Analysis for BG/L
• ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum
• Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L– good stress test for BG/L
robustness
•Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBMSP3, frost.