ORNL is managed by UT-Battelle for the US Department of Energy
Balancing Particle and Mesh Computation in a Particle-In-Cell Code
Patrick H. Worley Oak Ridge National Laboratory
May 12, 2016 CUG 2016 Grange Tower Bridge Hotel London, UK
These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC52-07NA27344. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.
ORNL is managed by UT-Battelle for the US Department of Energy
Alternate Title: Issues in Maintaining Scalability as Application and (Cray) Architectures Evolve Patrick H. Worley Oak Ridge National Laboratory
May 12, 2016 CUG 2016 Grange Tower Bridge Hotel London, UK
These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC52-07NA27344. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.
3 Presentation_name
Co-Authors • Eduardo F. D'Azevedo (Oak Ridge National Laboratory)
• Robert Hager (Princeton Plasma Physics Laboratory)
• Seung-Hoe Ku (Princeton Plasma Physics Laboratory)
• Eisung Yoon (Rensselaer Polytechnic Institute)
• Choong-Seock Chang (Princeton Plasma Physics National Laboratory)
4 Presentation_name
Acknowledgements • Support for this work was provided through the Scientific Discovery through
Advanced Computing (SciDAC) program funded by the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research and the Office of Fusion Energy Sciences, as part of the projects “Center for Edge Physics Simulation (EPSi)” and “Institute for Sustained Performance, Energy, and Resilience (SUPER)”. The work was performed at Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05- 00OR22725, at Princeton Plasma Physics Laboratory, which is managed by Princeton University under Contract No. DE- AC02-09CH11466, and at Rensselaer Polytechnic Institute under Contract No. DE-SC0008449.
• This work used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This research also used resources of the Argonne Leadership Computing Facility (ALCF), which is a DOE Office of Science User Facility supported under contract DE-AC02-06CH11357. This research also used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 . Awards of computer time at ALCF and OLCF were provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program.
5 Presentation_name
Background Science • Edge simulation (near outer wall) is critical for
tokamak-based fusion reactor design because the edge plasma conditions determine (i) core plasma quality, thus the fusion efficiency, and (ii) wall deterioration, thus the reactor lifetime
• Accurate simulation is difficult and requires effective exploitation of leadership computing systems: – Edge plasma is in contact with wall and density
has a very steep gradient, invalidating fluid approx. and requiring treatment of non-equilibrium physics
– Complicated geometry, currently addressed using unstructured triangular mesh
– Multiscale physics: background physics, microturbulence, neutrals and atomic physics
⇒ Unlike for simulations of just the core plasma, the scale-separated “delta-f” perturbation method is invalid.
ITER fusion reactor
6 Presentation_name
Target Application: XGC1 • Full-function (as opposed to perturbative delta-f)
gyrokinetic particle-in-cell code designed for simulating edge plasmas in tokamaks
• Solves 5D gyrokinetic equations via – ordinary differential equations for time
advance of particles in unstructured triangular physical space grid
– finite difference discretization of partial integro-differential Fokker-Planck collision operator on rectangular velocity-space grid
– Maxwell’s (partial differential) equations on unstructured triangular physical space grid, solved using PETSc
• The whole volume is simulated with a coarse mesh in the core to capture the large scale turbulence interaction between core and edge
Large scale turbulence in the whole volume simulation in DIII-D geometry
7 Presentation_name
XGC1: Key Features • diverted magnetic geometry
• gyrokinetic ions
• drift-kinetic electrons: small gyroradius limit (new as of 2013)
• Monte-Carlo neutral particle with wall-recycling coefficients
• multi-species impurity particles
• plasma heating in the core
• torque input in the core
• nonlinear Fokker-Planck-Landau collision operator (new as of 2014)
• logical Debye-sheath: code determines wall sheath from ambipolar loss constraint
• reads in an experimental geometry and plasma data
8 Presentation_name
XGC1: Program Flow I. Initialization
II. Timestep Loop: 1. For each step of a Runge-Kutta algorithm:
a. Collect particle charge density on underlying mesh. b. Solve gyrokinetc Poisson equation on mesh. c. Compute electric field and any derivative needed in particle equations of motion. d. Calculate and output diagnostic quantities. e. Update particle positions and velocities:
i. For electrons, subcycle (typically at least 60 times) with a fixed electric field (electron push).
ii. For ions, advance one step (ion push). f. Move particles between processes, as required by updated positions (particle shift).
2. Based on runtime parameter, N, every Nth timestep: a. Calculate particle collisions and apply this and other source terms to adjust fields.
III. Finalization
9 Presentation_name
XGC1: Prior Art XGC1, equipped with electrostatic turbulence capability to begin with, was further developed during the 2005-2012 DOE project Center for Plasma Edge Simulation (CPES):
• Lagrangian particle-in-cell code in 5D gyrokinetic phase space (3D configuration + 2D velocity)
• Solved gyrokinetic Vlasov equation with particle-momentum-energy conserving linear and nonlinear Coulomb collisions
• Used realistic geometry and boundary condition – World’s only kinetic code to include magnetic separatrix and material wall
• Utilized a number of sophisticated computational methods and tools, including PETSc, geometric hashing, Hilbert space filling curve, hybrid MPI-OpenMP parallelization, bicubic spline, etc,
• New computer science technologies were also developed, in particular the Adios adaptable IO system, DataSpaces data-staging substrate, and the eSiMon dashboard, all included in the End-to-end Framework for Fusion Integrated Simulation (EFFIS).
10 Presentation_name
XGC1: Pre-2013 This pre-2013 version of XGC1 scaled efficiently up to the full Cray XT5 at the Oak Ridge Leadership Computing Facility at the time (up to 223,488 cores), and routinely used over 70% of the system for production runs.
0
5
10
15
20
25
30
512 1024 2048 4096 8192 16384
Ave
rage
Sec
onds
per
Tim
este
p
Compute Nodes
XGC1 Performance (2010): Weak Particle Scaling
Cray XT5 (12 cores per node) ITER mesh 10.8M ions per node 3.6M ions per nodeDIII-D mesh 1.6M ions per node
128
256
512
1024
2048
4096
8192
512 1024 2048 4096 8192 16384
Mill
ion
Part
icle
s / s
econ
d
Compute Nodes
XGC1 performance on Cray XT5, Full-f simulation
Cray XT5 (12 cores per node) ITER mesh 10.8M ions per node 3.6M ions per nodeDIII-D mesh 1.6M ions per nodeExample linear scaling
11 Presentation_name
XGC1: 2013 • The introduction of drift-kinetic electrons changed the performance characteristics of XGC1
simulations significantly. The electron mass is much smaller than the ion mass, and the timestep for each electron push (calculating new particle positions) needs to be smaller. – XGC1 uses an electron sub-cycling method and ~60 electron pushes per ion push. The actual cost of the electron push is even larger than this implies, reaching more than 90% of the computation time in CPU-only runs (and an even higher percentage of flops). Table below is data from a 10 timestep run for a production DIII-D mesh on 32 XK7 nodes, without and with drift-kinetic electrons. Ion/electron flop ratio is representative across scales. On 16384 nodes (same particle count per node), ‘Main Loop’/’Electron Push’ flops ratio is 0.98, so slightly more than here.
Seconds Flop Count
(Max over procs.) (Total over procs.)
Main Loop 54 1246 8.1e+12 3.9e+14
Ion Push 13 13 4.0e+12 4.0e+12
Electron Push - 1142 - 3.8e+14
12 Presentation_name
XGC1: 2013 Performance • The drift-electron push was an obvious target for acceleration using the GPGPU available
on the Cray XK7 architecture. This was implemented using PGI CUDA Fortran, and achieved approximately 4X speed-up. This did require further optimizations to the associated communication algorithms to preserve good scalability for the whole code.
0
20
40
60
80
100
120
0 2 4 6 8 10 12 14 16
Ave
rage
Sec
onds
in P
USH
E pe
r XG
C1
Tim
este
p
Threads per MPI Task
XGC1 PUSHE Performance (16 nodes, 3.2 million ions and 3.2 million electrons per node)
Cray XK7, 16 nodes3.2M electrons and 3.2M ions per node CPU-only GPU-only 74% GPU, 26% CPU
0
50
100
150
200
250
300
0 2000 4000 6000 8000 10000 12000 14000 16000
Ave
rage
Sec
onds
per
XG
C1
Tim
este
p
Compute Nodes
XGC1 Performance: Weak Particle Scaling on DIII-D mesh (10 timesteps, 3.2 million ions and 3.2 million electrons per node)
Cray XK7 (1 16-core proc., 1 GPU per node)Jan. 2013 16 MPI tasks per node 1 MPI task, 16 threads per nodeJuly 2013 16 MPI tasks per node 1 MPI task, 16 threads per node 1 MPI task, 16 threads, 1 GPU per node
13 Presentation_name
XGC1: 2013 Performance • XGC1 also ran and scaled well on other systems.
– Mira: IBM BG/Q (“16”-core; 4-way HT per core; 1.6 GHz PowerPC) – Titan: Cray XK7 (8 FPUs, 16 integer cores; 2.2 GHz AMD; NVIDIA K20X GPU) – Piz Daint: Cray XC30 (8 cores, 2-way HT per core; 2.6 GHz Intel “Ivy Bridge”; NVIDIA K20X GPU) – Edison: Cray XGC30 (24 cores, 2-way HT per core; 2.4 GHz Intel “Ivy Bridge”)
0
50
100
150
200
64 256 1024 4096 16384
Ave
rage
Sec
onds
per
XG
C1
Tim
este
p
Compute Nodes
XGC1 Performance: Weak Particle Scaling on DIII-D mesh
DIII-D mesh IBM BG/Q 3.2M elec., 3.2M ions per node Cray XK7 3.2M elec., 3.2M ions per node Cray XC30 4.8M elec., 4.8M ions per node 3.2M elec., 3.2M ions per node
0
500
1000
1500
2000
2500
3000
0 5000 10000 15000 20000 25000 30000 35000
XGC
1 Ti
mes
teps
per
Day
Compute Nodes
XGC1 Performance: Strong Scaling on DIII-D mesh
DIII-D mesh: 32 planes, 56,980 vertices per plane25B electrons and 25B ionsEdison: 5,576 node Cray XC30 (24 cores, 24 FPUs per node)Piz Daint: 5272 node Cray XC30 (8 cores, 8 FPUs, 1 GPU per node)Titan: 18,688 node Cray XK7 (16 cores, 8 FPUs, 1 GPU per node)Mira: 48,152 node IBM BG/Q (16 cores, 16 FPUs per node)
14 Presentation_name
XGC1: 2014 • The nonlinear Fokker-Planck-Landau collision operator (potentially) changes the
performance characteristics of XGC1 simulations significantly (again). – The cost of the collision operator scales with the mesh size, not the particle count, and
is also a function of how often it is called (every timestep, every third timestep, …) and whether axisymmetry (between the mesh planes) is assumed.
– For 5 XGC1 timesteps of a representative science configuration (DIII-D mesh, 32 planes, 26B electrons, 8192 nodes on Titan), computing collisions every timestep, and not exploiting axisymmetry:
– Collision flop count would be divided by 32 if assuming axisymmetry, and by 3 if, e.g.,
computing every third timestep. However, electron push is implemented on the GPGPU on the XK7, and achieves a higher flop rate than Collision running on the CPU.
– Collision is also a prime candidate for porting to the GPGPU, and work on using OpenACC for this is in progress. Currently OpenMP is used effectively for thread level paralllelization.
Flop Count Main Loop 14.89e+15
Electron Push 5.37e+15
Collision 7.49e+15
15 Presentation_name
Hybrid Load Balancing • As ions and electrons are “pushed”, load imbalances will appear for any fixed mesh
decomposition. The particle velocity is such that load balancing in the toroidal direction is impractical, but also of little consequence in current simulations. The movement of particles in the radial or poloidal directions is much slower, and updating simple 1-D decompositions of a space-filling-curve ordering of the mesh points has proven very effective for load balancing particles, and thus both particle-related computational costs and memory requirements.
• The collision cost associated with a mesh cell is not uniform across the mesh, depending on the temperature of the plasma in the cell. Because the temperature distribution evolves with the simulation, the corresponding cost will also vary. However, temperature will generally be higher in some parts of the domain than in others throughout the simulation, and does not typically change its distribution very quickly.
• The same mesh decompositions are used for both the particle pushing and for the collision operator calculation, and what is optimal for one is not optimal for the other.
• So, when neither particle push nor collision operator dominate the computational cost (as measured in wallclock time), a compromise mesh decomposition must be determined, and updated as both collision and particle cost evolve, otherwise performance will be lost.
16 Presentation_name
Collision Load Balance Diagnosis • Even if the collision cost was uniform across the mesh, the optimal decomposition for
particles would be inefficient.
• For this one example problem, even at the very beginning of the run, there is a significant discrepancy in the number of nodes (mesh vertices) assigned to each process (left curve). The pattern is more obvious when looking at the number of nodes assigned to the subset of processes representing each “interplane” in the grid decomposition (same process index in a plane).
0
20
40
60
80
100
120
140
160
0 500 1000 1500 2000
Num
ber o
f ver
tices
ass
igne
d to
pro
cess
Process
Initial vertex load imbalance ti195-nstx-with-col
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 10 20 30 40 50 60N
umbe
r of v
ertic
es a
ssig
ned
to p
roce
ss in
terp
lane
Interplane Id
Initial vertex load imbalance ti195-nstx-with-col
17 Presentation_name
Collision Load Balance Diagnosis
• The number of iterations to convergence summed over all nodes assigned to a process or to an “interplane” of processes, is even greater than that of the node distribution imbalance, and the maximum cost is NOT on the processes with the maximum number of nodes.
0
500
1000
1500
2000
2500
0 500 1000 1500 2000
Tota
l num
ber o
f col
lisio
n ite
ratio
ns
Process
Initial iteration load imbalance ti195_nstx_with_col
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 10 20 30 40 50 60To
tal n
umbe
r of c
ollis
ion
itera
tions
Interplane Id
Initial iteration load imbalance ti195_nstx_with_col
• But the collision cost is not uniform. One issue is that an iterative (Picard) algorithm is used to solve in each mesh cell, and the number of iterations required for convergence varies.
18 Presentation_name
Collision Load Balance Diagnosis • As mentioned earlier, there is a strong domain dependence, due to the standard
temperature profile of the plasma.
19 Presentation_name
Black Box – First Attempt • Diagnosis generated lots of pretty pictures, but did not lead to any physics-inspired
solutions. Started over.
• First attempt: Simultaneous optimization – Can measure collision cost per cell directly. Can determine 1-D decomposition to load
balance this. – Can measure particle count per cell directly. Can determine 1-D decomposition to load
balance this. – How to combine?
Was not successful in combining particle count and collision cost metrics into a single function to be minimized.
– Load balancing particle distribution does load balance particle-associated cost, but the actual function mapping count to cost is complicated.
– Measuring cost (walltime) directly as a function of mesh cell was also infeasible because of the large number of particle-loops and synchronization points that exacerbate the impact of the particle count load imbalances.
20 Presentation_name
Black Box – Second Attempt • Second attempt: Constrained optimization
– Add a new input parameter β that specifies the maximum permissible particle load imbalance. (See below.)
– Add a new input parameter to specify how often to load balance (in increments of collision timesteps).
– Measure number of particles per cell each ion timestep. – Measure collision cost per cell directly each ion timestep that collision is calculated. – When load balancing,
• Determine 1-D decomposition to load balance particle count. Record associated worst case particle load imbalance ratio (maximum over processes / average per process). Call it α. (Typically not possible to load balance perfectly.)
• Determine 1-D decomposition to load balance collision cost as well as possible subject to a maximum particle load imbalance of α*β.
– Also update load balance whenever particle load imbalance exceeds α*β*γ (where γ is another already existing parameter, and this logic is already in the code).
This works, but is somewhat unsatisfactory in that choosing β is not obvious. It also does not adapt to changes in the distribution of the collision operator.
21 Presentation_name
Black Box – Third and Current Attempts • Third attempt: Two level scheme
– Record elapsed walltime from one collision timestep to the next, and refer to this as γ. – Use Golden Section Search to modify β each time load balancing algorithm is invoked,
looking to minimize γ as a function of β. (Note that the input parameter is now the initial value used.)
This does not address the issue of the collision cost distribution varying with the simulation, so walltime associated with a given value of β may not be accurate later in the simulation. Performance is also variable due to external factors (interconnect contention, etc.), and so the optimization algorithm needs to be robust with respect to perturbations.
• Fourth attempt: Modified Golden Section Search – Golden Section Search keeps history (ζ) corresponding to 3 β values. If any of these
values is not replaced after three updates, reuse this β value and force an update to the associated ζ wallclock time.
– Add new parameter to specify the largest permissible β value.
Experience usually provides a good estimate of the appropriate search interval, especially if a job is being restarted from a checkpoint. A tight upper bound on β accelerates convergence.
22 Presentation_name
Golden Section Search Cartoon
From http://ops.fhwa.dot.gov/trafficanalysistools/tat_vol3/sectapp_d.htm
23 Presentation_name
Black Box Results • Example optimal decompositions
– Uniform mesh decomposition – Particle-only load balanced decomposition – Hybrid particle count, collision cost decomposition
0
5000
10000
15000
20000
25000
30000
35000
0 50 100 150 200 250
Num
ber o
f mes
h ve
rtic
es s
umm
ed a
cros
s pl
anes
Interplane Id
XGC1 Performance: Load Balanced Graph Partitions
DIII-D mesh: 32 planes, 105,484 vertices per plane 10B electrons and 10B ions Titan: Cray XK7 (16-core CPU, 1 GPU per node) 4096 nodes, 8192 processes, 8-way threading
Mesh Vertex Count Load BalanceParticle Count Load Balance
Hybrid Collision Cost-Particle Count Load Balance
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0 50 100 150 200 250
Rel
ativ
e pa
rtic
le c
ount
sum
med
acr
oss
plan
es
Interplane Id
XGC1 Performance: Load Balancing Both Collision Cost and Particles
DIII-D mesh: 32 planes, 105,484 vertices per plane 10B electrons and 10B ions Titan: Cray XK7 (16-core CPU, 1 GPU per node) 4096 nodes, 8192 processes, 8-way threading
Particle-Only Load BalancingHybrid Collision-Particle Load Balancing
24 Presentation_name
Black Box Results • Example performance impacts
– Left: impact on collision load imbalance if only addressing particle load imbalance – Right: Here an initial run is followed by restart run, where the number of outer threads
used in the collision operator is increased from 1 to 8. The initial run starts with particle-only load balancing. The restart inherits the decomposition from the previous run, which is inefficient here, but the hybrid load balancing algorithm quickly doubles performance in both cases.
0
0.002
0.004
0.006
0.008
0.01
0.012
0 50 100 150 200 250
Rel
ativ
e co
llisi
on ru
ntim
e su
mm
ed a
cros
s pl
anes
Interplane Id
XGC1 Performance: Load Balancing Both Collision Cost and Particles
DIII-D mesh: 32 planes, 105,484 vertices per plane 10B electrons and 10B ions Titan: Cray XK7 (16-core CPU, 1 GPU per node) 4096 nodes, 8192 processes, 8-way threading
Particle-Only Load BalancingHybrid Collision-Particle Load Balancing
0
500
1000
1500
2000
2500
0 20 40 60 80 100 120 140
Seco
nds
per t
imes
tep
Timestep
Performance of XGC1 ITER simulation
Titan: Cray XK7 (16-core CPU, 1 GPU per node) 16384 nodes, 16384 processes, 16-way threading ITER mesh: 1M vertices per plane, 32 planes 131 billion each electrons and ions
1 outer thread, 16 inner threads8 outer threads, 2 inner threads
25 Presentation_name
Black Box Results: 2nd Example Back Story: Nested OpenMP parallelism is implemented in the collision operator for performance portability and to expose additional parallelism for future processor architectures. Outer OpenMP is usually a little more efficient, but requires more memory, and so a mixed strategy works best on the IBM BG/Q. On Edison (Cray XC30), both inner and outer are equally good. For the Titan runs, we started with an Edison-like specification (all 16 inner threads) but neglected to specify
export OMP_MAX_ACTIVE_LEVELS=2
In consequence, the job began without any OpenMP threading in the collision operator. We noticed the problem, and adjusted the settings to 8 outer threads and 2 inner threads (still not realizing the source of the problem), and this improved performance significantly. It also changed the collision cost to particle cost ratio, and thus changed the optimal load balance decomposition.
Details: – Collision operator computed every ion timestep – Hybrid load balancing scheme applied every second collision operator timestep (so
every second ion timestep)
26 Presentation_name
Black Box Results: 2nd Example • Left: Fraction of total runtime represented by collision operator. Using 8 OpenMP threads
almost decreases this by half. Note that hybrid load balancing also decreases this metric.
• Right: Impact of Golden Section Search on particle load imbalance constraint. Note that Golden Section Search History is lost when restarting, so both runs have the same first three values. After that, the higher collision cost (blue) leads to a looser particle load imbalance constraint when optimizing runtime.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90
Col
lisio
n C
ost F
ract
ion
of T
otal
Timestep
Fraction Collision Cost of Total Cost
Titan: Cray XK7 (16-core CPU, 1 GPU per node) ITER mesh (1M vertices), 131B particles (electrons, ions)
16384 nodes, 16384 processes, 16-way threading 16 inner collision threads
8 outer/2 inner collision threads
1
1.05
1.1
1.15
1.2
1.25
1.3
0 10 20 30 40 50 60 70 80 90
Part
icle
Loa
d Im
bala
nce
Con
stra
int
Timestep
Constraint on Collision Cost Load Balance Optimization
Titan: Cray XK7 (16-core CPU, 1 GPU per node) ITER mesh (1M vertices), 131B particles (electrons, ions)
16384 nodes, 16384 processes, 16-way threading 16 inner collision threads
8 outer/2 inner collision threads
27 Presentation_name
Black Box Results: 2nd Example • Left: Impact of hybrid load balancing on collision cost load imbalance. Both initial and
restart runs start off with very large load imbalances, which are quickly addressed.
• Right: Impact of hybrid load balancing on electron particle distribution load imbalance. Particle load imbalance seems to be less variable than collision cost (particle load imbalance does not change much between hybrid load balancing algorithm adjustments).
1
1.5
2
2.5
3
3.5
0 10 20 30 40 50 60 70 80 90
Load
Imba
lanc
e (M
ax/A
vg)
Timestep
Collision Load Imbalance
Titan: Cray XK7 (16-core CPU, 1 GPU per node) ITER mesh (1M vertices), 131B particles (electrons, ions)
16384 nodes, 16384 processes, 16-way threading 16 inner collision threads
8 outer/2 inner collision threads
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
0 10 20 30 40 50 60 70 80 90
Load
Imba
lanc
e (M
ax/A
vg)
Timestep
Electron Particle Load Imbalance
Titan: Cray XK7 (16-core CPU, 1 GPU per node) ITER mesh (1M vertices), 131B particles (electrons, ions)
16384 nodes, 16384 processes, 16-way threading 16 inner collision threads
8 outer/2 inner collision threads
28 Presentation_name
Summary • New science capabilities are continuing to increase cost and change performance
characteristics of the XGC1 code, and characteristics are very sensitive to choice of the many options.
• Recent addition of new nonlinear collision operator required re-engineering load balance scheme to address both grid-based and particle-based load imbalances simultaneous, to respond to evolutionary changes, and to deal robustly with performance outliers (whether due to simulation or to externally caused performance variability).
• Current algorithm has more than doubled performance for example production runs.
• Future work will investigate automatic determination of appropriate load balancing frequency, and elimination of need to set upper bounds.