Click here to load reader

E3C: Exploring Energy Efficient Computing

  • View

  • Download

Embed Size (px)

Text of E3C: Exploring Energy Efficient Computing

Dawn Geatches, Science & Technology Facilities Council, Daresbury Laboratory,
Warrington, WA4 4AD. [email protected]
This scoping project was funded under the Environmental Sustainability Concept Fund
(ESCF) within the Business Innovation Department of STFC.
This document is a first attempt to demonstrate how users of the quantum mechanics-
based software code CASTEP1 can run their simulations on high performance computing
(HPC) architectures efficiently. Whatever the level of experience a user might have, the
climate crisis we are facing dictates that we need to (i) become aware of the consumption of
computational resources of our simulations; (ii) understand how we, as users can reduce this
consumption; (iii) actively develop energy efficient computing habits. This document provides
some small insight to help users progress through stages (i) and (ii), empowering them to
adopt stage (iii) with confidence.
This document is not a guide to setting-up and running simulations using CASTEP, these
already exist (see, for example CASTEP ). Assumptions are made throughout this document
that the user has a basic familiarity of the software and its terminology. This document does
not exhaust all of the possible ways to reduce computational cost – much will be left to the
user to discover for themselves and to share with the wider CASTEP community (e.g. via the
JISCMAIL CASTEP Users Mailing List ). Thank you!
2. Reducing the energy used by your simulation
A. Cell file
B. Param file
C. Submission script
4. What else can a user do?
5. What are the developers doing?
1. Computational cost of simulations
‘Computational cost’ in the context of this project is synonymous with ‘energy used’. As a
user of high performance computing (HPC) resources have you ever wondered what effect
your simulations have on the environment through the energy they consume? You might be
working on some great new renewable energy material and running hundreds or thousands
of simulations over the lifetime of the research. How does the energy consumed by the
research stack against the energy that will be generated/saved/stored etc. by the new
material? Hopefully, the stacking is gigantically in favour of the new material and its
promised benefits.
Fortunately, we can do more than hope that that is the case, we can actively reduce the
energy consumed by our simulations, indeed, it’s the responsibility of every single
computational modeller to do exactly that. Wouldn’t it be great (not to say impressive) if,
when you write your next funding application, you can give a ballpark figure as to the amount
of energy your computational research will consume over the lifetime of the project?
As a user you might be thinking ‘but what effect can I have when surely the HPC architecture
is responsible for energy usage?’ and ‘then there’s the code itself, which should be as
efficient as possible but if it’s not I can’t do anything about that?’ Both of these thoughts are
grounded in truth: the HPC architecture is fixed - but we can use it efficiently; the software
we’re using is structurally fixed – but we can run it efficiently.
The energy cost (E) of a simulation is the total power per core (P) consumed over the length
of time (T ) of the simulation, which for parallelised simulations run on (N) cores is: = .
From this it is logical to think that reducing N, P, and/or T will reduce E, which is theoretically
true. Practically though, let’s assume that the power consumed by each core is a fixed
property of the HPC architecture, we now have: ∝ . This effectively encapsulates where
we, as users of HPC, can control the amount of energy our simulations consume and seems
simple. All we need to do is learn how to optimize the number of cores and the length of time
of our simulations.
We use multiple cores to share the memory load and to speed-up a calculation, giving us
three calculation properties to optimise: number of cores; memory per core; time. To reduce
the calculation time we might first increase the number of cores. Many users might already
know that the relationship between core count and calculation time is non-linear thanks to
the required increase in core-to-core and node-to-node communication time. Taking the
latter into account means the total energy used is = + (, ) where (, ) captures
the energy cost of the core-core/node-node communication time.
To optimise energy efficiency, any speed-up in calculation time gained by increasing the
number of cores, needs to balance the increased energy cost of using additional cores.
Therefore, the speed-up factor needs to be more than the factor of the number of cores as
shown in the equations below for a 2-core vs serial example.
= , (() = 0) Energy of serial (i.e. 1-core) calculation
2 = 22 + (2, 2) Energy of 2-core calculation
2 ≤ For the energy cost of using 2 cores to be no greater
than the energy cost of the serial calculation
22 + (2, 2) ≤ i.e. 2 + 1
2 (2, 2) ≤
2 which means that the total
calculation time using 2-cores needs to be less than half of the serial time. So, for users to
run simulations efficiently in parallel, they need to balance the number of cores and the
associated memory load per core, and the total calculation time. The following section shows
how some of the more commonly used parameters within CASTEP affect these three
NB: The main purpose of the following examples is to illustrate the impact of different
user-choices on the total energy cost of simulations. These examples do not indicate
the level of ‘accuracy’ attained because ‘accuracy’ is determined by the user
according to the type, contents, and aims of their simulations.
2. Reducing the energy used by your simulation
This section uses an example of a small model of a clay mineral, (and later a carbon
nanotube) to illustrate how a user can change the total energy their simulation uses by a
judicious choice of CASTEP input parameters.
Figure 1 unit cell of generic silicate clay mineral comprising 41 atoms
A. Cell file
Choose the pseudopotential according to the type of simulation, e.g. for simulations of cell
structures ultrasofts2 are often sufficient, although if the pseudopotential library does not
contain an ultrasoft version for a particular element, the on-the-fly-generated (OTFG)
ultrasofts3 might suffice. If a user is running a spectroscopic simulation such as infrared
using density functional perturbation theory4 then norm-conserving5 or OTFG norm-
conserving3 could be the better choices. The impact of pseudopotential type on the
computational cost is shown in Table 1 through the total (calculation) time.
Type of pseudopotential
Ultrasoft Norm- conserving
# coresa 5 5 5 5 5
Memory/process (MB)
Peak memory use (MB)
Total time (secs) 55 89 250 109 136
Table 1 Pseudopotential and size of planewave set required on ‘fine’ setting of Materials Studio 20206, and an example of memory requirements and time required for a single point energy calculation using the recorded number of cores on a single node. Unless otherwise stated, the same cut-off energy per type of pseudopotential is implied throughout this document. aUsing Sunbird (CPU: 2x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each), unless stated otherwise all calculations were performed on this HPC cluster. bDesigned to be used at the same modest (340 eV) kinetic energy cut-off across the periodic table. They are ideal for moderate accuracy high throughout calculations e.g. ab initio random structure searching (AIRSS).
Changing the number of Brillouin zone sampling points can have a dramatic effect on
computational time as shown in Table 2. Bear in mind that increasing the number of k-points
increases the memory requirements, often tempting users to increase the number of cores,
further increasing overall computational cost. Remember though, it’s important to use the
number of k-points that provide the level of accuracy your simulations need.
Type of pseudopotential
OTFG Norm-conserving
kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)
Memory/process (MB)
Peak memory use (MB)
Total time (secs) 32 55 222 85 136 477
Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets), showing the
effects for different pseudopotentials.
Vacuum space
When building a material surface it is necessary to add vacuum space to a cell (see Figure 2
for example), and this adds to the memory requirements and calculation time because the
‘empty space’ (as well as the atoms) is ‘filled’ by planewaves. Table 3 shows that doubling
the volume of vacuum space doubles the total calculation time (using the same number of
Overall parallel efficiencya
69% 66% 67% 61%
Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example –adsorbate not included in the above)
Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points, run on 5 cores, showing the effects of vacuum space. aCalculated automatically by CASTEP.
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources; nevertheless, it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35% (41 to 82 atoms) to 72%, (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 8.5 respectively. In good practice, the number
of kpoints is scaled according to the supercell size, increasing the computational cost more
Supercell size (# atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69% 69% 74% 67% 72% Table 4 Single point energy calculations using ultrasoft pseudo-potentials, run on 5 cores, showing the effects of supercells. aCalculated automatically by CASTEP.
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency. The effect becomes significant when a system is large,
disproportionately longer along one of its lengths, and is misaligned with the x-, y-, z-axes,
see Figure 4 and Table 5 for exaggerated examples of misalignment. This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space; it converts the 3-d fast Fourier transforms (FFT) to three, 1-d FFT columns that lie
parallel to the x-, y, z-axes.
Figure 4 Top row: A capped carbon nanotube (160 atoms), and bottom row a long carbon nanotube (1000 atoms) showing:
long axes aligned in the x-direction (left); z-direction (middle); skewed (right).
Orientation (# atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Memory/process (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
Relative total energy (# cores * total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig. 4, using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point. aCalculated automatically by CASTEP.
B. Param file
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving, they do need a finer electron density grid scale in the settings ‘grid_scale’ and
‘fine_grid_scale’. As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger,
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid.
Memory/process (MB)
Peak memory use (MB)
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores, showing the effects of different electron density grid settings.
Data Distribution
Parallelizing over plane wave vectors (‘G-vectors’), k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7.
The default for a .param file without the keyword ‘data_distribution’ is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points, see for example, Table 7,
columns 2 and 3. Inserting ‘data_distribution : kpoint’ into the .param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script. In the
example tested, selecting data distribution over kpoints increased the calculation time over
the default of no data distribution; compare columns 3 and 5 of Table 7.
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points, has the
overall largest impact on reducing calculation time –see columns 6 and 7 of Table 7.
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores, the ‘mixed’ distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution –
compare columns 2 and 3 with 8 and 9.
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds). The results are system-specific and need
careful testing to tailor to different systems.
Number of tasks per node
This is invoked by adding ‘num_proc_in_smp’ to the .param file, and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group. This means that the “all-to-all” communications is then done in three phases instead of one: (1) tasks within an SMP collect their data together on a chosen “controller” task within their group; (2) the “all-to-all” is done between the controller tasks; (3) the controllers all distribute the data back to the tasks in their SMP groups. For small core counts, the overhead of the two extra phases makes this method slower than just doing an all-to-all; for large core counts, the reduction in the all-to-all time more than
compensates for the extra overhead, so it’s faster. Indeed, the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag: ‘data_distribution : gvector’ (compare columns 3 and 9) for the test HPC cluster – Sunbird, reflecting the requested small core count. Generally speaking, the more cores in the G- vector group, the higher you want to set “num_proc_in_smp” (up to the physical number of cores on a node).
Column # 1 2 3 4 5 6 7 8 9
Requested data distribution + # cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
Relative total energy (# cores * total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points, showing the effects of data distribution across different numbers of cores requested in the script file. ‘Actual data distribution’ means that reported by CASTEP on completion in this and (where applicable) all following Tables. ‘Relative total energy’ assumes that each core requested by the script consumes X amount of electricity. aCalculated automatically by CASTEP.
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memory/process (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96% 66% 98% 60% 98% 56% 96% 56%
Column # 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points, run on 5 cores, showing the effects of setting ‘num_proc_in_smp : 2, 4, 5’, both with and without the ‘data_distribution : gvector’ flag. ‘Default’ means ‘num_proc_in_smp’ absent from .param file. aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the ‘opt_strategy’ flag in the
.param file:
Default - Balances speed and memory use. Wavefunction coefficients for all k-points
in a calculation will be kept in memory, rather than be paged to disk. Some large
work arrays will be paged to disk.
Memory - Minimizes memory use. All wavefunctions and large work arrays are paged
to disk.
Speed - Maximizes speed by not paging to disk.
This means that if a user runs a large memory calculation, optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons.
opt_strategy Default Memory Speed
Peak memory use (MB)
Overall parallel efficiencya
94% 97% 96%
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points, run on 5 cores, showing the effects of optimizing for speed or memory. ‘Default’ means either omitting the ‘opt_strategy’ flag from the .param file or adding it as ‘opt_strategy : default’. aCalculated automatically by CASTEP.
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron. This directly affects the
calculation time, effectively doubling it as shown in Table 10.
.param flag and setting
Overall parallel efficiencya
96% 98%
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points, run on 5 cores, showing the effects of spin polarization. aCalculated automatically by CASTEP.
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (‘DM’). When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed, it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost –see Table 11.
.param flag and setting
metals_method (Electron minimization) DM EDFT
Memory/process (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96% 97%
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points, run on 5 cores, showing the effects of the electronic minimization method. ‘DM’ means density mixing and ‘EDFT’ ensemble density functional theory.
aCalculated automatically by CASTEP.
C. Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
(i) The variable familiar to most HPC users describes the number of cores (‘tasks’)
requested for the simulation. Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time.
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly.
(iii) When not requesting use of all cores on a single node, remove the ‘exclusive’
flag to accelerate progress through the job queue.
(iv) Using the most recent version of…