Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010

Pursuing Faster I/O in COSMO

POMPA Workshop May 3rd 2010

Recap –why I/O is such a problem

POMPA Workshop May 3rd 2010 2

• The I/O problem:– I/O is the limiting factor for scaling of COSMO

• Limiting factor for many data intensive applications

– Speed of I/O subsystems for writing data is not keeping up with increases in speed of compute engines

Idealised 2D grid layout:

Increasing the number of processors by 4 leads to each processor having

• one quarter the number of grid points to compute• one half the number of halo points to communicate

The same amount of total data needs to be output at each time step.

P processors, each with …MxN Grid points2M+2N Halo points

4P processors, each with …(M/2)x(N/2) Grid pointsM+N Halo points

I/O reaches a scaling limit


Computation: Scales O(P) for P processors Minor scaling problem – issues of halo memory bandwidth, vector lengths, efficiency of software pipeline etc.

Communication: Scales O(√P) for P processors Major scaling problem – the halo region decreases slowly as you increase the number of processors

I/O (mainly “O”): No scaling Limiting factor in scaling– the same amount of total data is output at each time step

Current I/O strategies in COSMO

• Two types of output format – GRIB and NetCDF– Grib dominant in operational weather forecasting– NetCDF is the main format used in climate research

• GRIB output has the possibility of using asynchronous I/O processes to improve parallel performance

• NetCDF is always ultimately serialised through process zero of the simulation

• Actually in each case of GRIB and NetCDF the output is a multi-level data collection approach

4POMPA Workshop May 3rd 2010

Multi-level approach


A 3 x 6 grid of processes, each with 3 atmospheric levels Collect on

atmospheric levels

Proc 0

Proc 1

Proc 2

I/O Proc

Storage Data is sent to I/O proc level by level

Performance limitations and constraints

• Both Grib and NetCDF formats carry out the gather on levels stage

For Grib-based weather simulations the final collect-and-store stage can deploy multiple I/O processes to deal with the data.– Allows improved performance where real storage bandwidth is the

bottleneck– Produces multiple files (one per I/O process) that can easily be

concatenated together

• Only process 0 can currently act as an I/O proc for the collect-and-store stage with NetCDF– Serialises the I/O through one compute process


Possible strategies for fast NetCDF I/O

1. Use a version of parallel NetCDF to have all compute processes write to disk– eliminate both the gather-on-

levels and collect-and-store stages

2. Use a version of parallel NetCDF on the subset of compute processes that are needed for the gather stage– Eliminate the collect-and-store

stage3. Use a set of asynchronous

processes as is currently done in the Grib implementation– If more than one asynchronous

process is employed this would require parallel NetCDF or post-processing


Full parallel strategy

• A simple micro-benchmark of 3D data distributed on a 2D process grid showed reasonable results

• This was implemented in the RAPS code and tested with the IPCC benchmark at ~ 900 cores– No smoothing operations in this benchmark or in the code

• The results were poor– Much of the I/O in this benchmark is 2D fields– Not much data is written at each timestep– The current I/O performance is not bad– The parallel strategy became dominated by metadata operations

• File writes for 3D fields were reasonably fast (~0.025s for 50 Mbytes)

• Opening the file took a long time (0.4 to 0.5 seconds)

• The strategy may be useful for high-resolution simulations writing large 3D blocks of data– Originally this strategy was expected to target 2000 x 1000 x 60+ grids


Slowdown from metadata

• The first strategy has problems related to metadata scalability

• Most modern high-performance file systems use POSIX I/O to open/close/seek etc.

• This is not scalable as it reduces file access operations to the time taken for Metadata operations


Non-scalable metadata – file open speeds

• Opening a file is not a scalable operation on modern parallel file systems

– See graph of two CSCS filesystems!!• There are some mitigation

strategies in MPI’s Romio/Adio layer

– “Delayed open” only makes the POSIX open call when actually needed

• For MPI-IO collective operations, only a subset of processes actually write the data

– No mitigation strategies for specific file systems (Lustre and GPFS)

• With current file systems using POSIX I/O calls full parallel I/O is not scalable

• We need to pursue the other strategies for COSMO, unless large blocks of data are being written


Time in seconds to open a file against number of MPI processes involved in file open

Next steps

• We are looking at all 3 strategies for improving NetCDF I/O• We are investigating the current state of Metadata accesses

in the MPI-IO layer and in file systems in general– Particularly Lustre and GPFS, but others (e.g. OrangeFS)

• … but for some jobs the individual I/O operations might not be large enough to allow much speedup


Documents

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010