11
Efcient data IO for a Parallel Global Cloud Resolving Model Bruce Palmer a, * , Annette Koontz a , Karen Schuchardt a , Ross Heikes b , David Randall b a Computational Sciences and Mathematics Division, Pacic Northwest National Laboratory, Richland, WA 99352, USA b Department of Atmospheric Sciences, Colorado State University, Fort Collins, CO 80523, USA article info Article history: Received 7 June 2010 Received in revised form 11 August 2011 Accepted 17 August 2011 Available online 26 October 2011 Keywords: High performance IO Parallel IO libraries Data formatting Geodesic grid Global Cloud Resolving Model Grid Specications abstract Execution of a Global Cloud Resolving Model (GCRM) at target resolutions of 2e4 km will generate, at a minimum, 10s of Gigabytes of data per variable per snapshot. Writing this data to disk, without creating a serious bottleneck in the execution of the GCRM code, while also supporting efcient post-execution data analysis is a signicant challenge. This paper discusses an Input/Output (IO) application programmer interface (API) for the GCRM that efciently moves data from the model to disk while maintaining support for community standard formats, avoiding the creation of very large numbers of les, and supporting efcient analysis. Several aspects of the API will be discussed in detail. First, we discuss the output data layout which linearizes the data in a consistent way that is independent of the number of processors used to run the simulation and provides a convenient format for subsequent analyses of the data. Second, we discuss the exible API interface that enables modelers to easily add variables to the output stream by specifying where in the GCRM code these variables are located and to exibly congure the choice of outputs and distribution of data across les. The exibility of the API is designed to allow model developers to add new data elds to the output as the model develops and new physics is added. It also provides a mechanism for allowing users of the GCRM code to adjust the output frequency and the number of elds written depending on the needs of individual calculations. Third, we describe the mapping to the NetCDF data model with an emphasis on the grid description. Fourth, we describe our messaging algorithms and IO aggregation strategies that are used to achieve high band- width while simultaneously writing concurrently from many processors to shared les. We conclude with initial performance results. Published by Elsevier Ltd. 1. Introduction The push to create more reliable and accurate simulations for environmental modeling has led to an increasing reliance on parallel programming to run larger and more detailed simulations in a timely manner. This is particularly true for simulations of climate change, but parallel programming, particularly programs that scale to large numbers of processors, is becoming increasingly important in other areas of environmental modeling as well. Indi- vidual environmental components are being run at larger scales and components are being coupled together to create larger models of environmental systems. Examples of the use of parallel codes in environmental simulation include hydrology (Hammond and Lichtner, 2010), surface water modeling (Von Bloh et al., 2010; Neal et al., 2010; Yu, 2010), and simulations of the ocean (Maltrud and McClean, 2005). Simulations of climate in general and the atmosphere in particular have a long history of using parallel computation to increase the complexity of the models simulated and to extend the resolution and timescales of simulations (Drake et al., 2005; Dabdub and Seinfeld, 1996). Higher resolution is being used to reduce the uncertainties and systematic errors due to parameterizations and other sources of error associated with coarser grained models. Our ability to simulate climate change over extended periods is heavily constrained by uncertainty in the subgrid models that are used to describe behavior occurring at scales less than the dimensions of a single grid cell. Typical grid cell dimensions are in the range of 35e70 km for current global simulations of climate. At these resolutions, much of the behavior at the subgrid scale, particularly of clouds, must be heavily modeled and parameterized and different models give signicantly different results. The behavior of clouds in these models is a major source of uncertainty (Liou, 1986). Efforts are currently underway to develop a Global Cloud Resolving Model (GCRM) (Randall et al., 2003) designed to * Corresponding author. Tel.: þ1 509 375 3899. E-mail addresses: [email protected] (B. Palmer), [email protected] (A. Koontz), [email protected] (K. Schuchardt), [email protected] (R. Heikes), [email protected] (D. Randall). Contents lists available at SciVerse ScienceDirect Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft 1364-8152/$ e see front matter Published by Elsevier Ltd. doi:10.1016/j.envsoft.2011.08.007 Environmental Modelling & Software 26 (2011) 1725e1735

Efficient data IO for a Parallel Global Cloud Resolving Model

Embed Size (px)

Citation preview

Page 1: Efficient data IO for a Parallel Global Cloud Resolving Model

at SciVerse ScienceDirect

Environmental Modelling & Software 26 (2011) 1725e1735

Contents lists available

Environmental Modelling & Software

journal homepage: www.elsevier .com/locate/envsoft

Efficient data IO for a Parallel Global Cloud Resolving Model

Bruce Palmera,*, Annette Koontza, Karen Schuchardta, Ross Heikesb, David Randallb

aComputational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA 99352, USAbDepartment of Atmospheric Sciences, Colorado State University, Fort Collins, CO 80523, USA

a r t i c l e i n f o

Article history:Received 7 June 2010Received in revised form11 August 2011Accepted 17 August 2011Available online 26 October 2011

Keywords:High performance IOParallel IO librariesData formattingGeodesic gridGlobal Cloud Resolving ModelGrid Specifications

* Corresponding author. Tel.: þ1 509 375 3899.E-mail addresses: [email protected] (B. Palm

(A. Koontz), [email protected] (K. Schuchard(R. Heikes), [email protected] (D. Randall).

1364-8152/$ e see front matter Published by Elsevierdoi:10.1016/j.envsoft.2011.08.007

a b s t r a c t

Execution of a Global Cloud Resolving Model (GCRM) at target resolutions of 2e4 km will generate, ata minimum, 10s of Gigabytes of data per variable per snapshot. Writing this data to disk, without creatinga serious bottleneck in the execution of the GCRM code, while also supporting efficient post-executiondata analysis is a significant challenge. This paper discusses an Input/Output (IO) applicationprogrammer interface (API) for the GCRM that efficiently moves data from the model to disk whilemaintaining support for community standard formats, avoiding the creation of very large numbers offiles, and supporting efficient analysis. Several aspects of the API will be discussed in detail. First, wediscuss the output data layout which linearizes the data in a consistent way that is independent of thenumber of processors used to run the simulation and provides a convenient format for subsequentanalyses of the data. Second, we discuss the flexible API interface that enables modelers to easily addvariables to the output stream by specifying where in the GCRM code these variables are located and toflexibly configure the choice of outputs and distribution of data across files. The flexibility of the API isdesigned to allow model developers to add new data fields to the output as the model develops and newphysics is added. It also provides a mechanism for allowing users of the GCRM code to adjust the outputfrequency and the number of fields written depending on the needs of individual calculations. Third, wedescribe the mapping to the NetCDF data model with an emphasis on the grid description. Fourth, wedescribe our messaging algorithms and IO aggregation strategies that are used to achieve high band-width while simultaneously writing concurrently from many processors to shared files. We concludewith initial performance results.

Published by Elsevier Ltd.

1. Introduction

The push to create more reliable and accurate simulations forenvironmental modeling has led to an increasing reliance onparallel programming to run larger and more detailed simulationsin a timely manner. This is particularly true for simulations ofclimate change, but parallel programming, particularly programsthat scale to large numbers of processors, is becoming increasinglyimportant in other areas of environmental modeling as well. Indi-vidual environmental components are being run at larger scalesand components are being coupled together to create largermodelsof environmental systems. Examples of the use of parallel codes inenvironmental simulation include hydrology (Hammond andLichtner, 2010), surface water modeling (Von Bloh et al., 2010;

er), [email protected]), [email protected]

Ltd.

Neal et al., 2010; Yu, 2010), and simulations of the ocean(Maltrud andMcClean, 2005). Simulations of climate in general andthe atmosphere in particular have a long history of using parallelcomputation to increase the complexity of the models simulatedand to extend the resolution and timescales of simulations (Drakeet al., 2005; Dabdub and Seinfeld, 1996). Higher resolution isbeing used to reduce the uncertainties and systematic errors due toparameterizations and other sources of error associated withcoarser grained models.

Our ability to simulate climate change over extended periods isheavily constrained by uncertainty in the subgrid models that areused to describe behavior occurring at scales less than thedimensions of a single grid cell. Typical grid cell dimensions are inthe range of 35e70 km for current global simulations of climate. Atthese resolutions, much of the behavior at the subgrid scale,particularly of clouds, must be heavily modeled and parameterizedand different models give significantly different results. Thebehavior of clouds in these models is a major source of uncertainty(Liou, 1986). Efforts are currently underway to develop a GlobalCloud Resolving Model (GCRM) (Randall et al., 2003) designed to

Page 2: Efficient data IO for a Parallel Global Cloud Resolving Model

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e17351726

run at a grid resolution on the order of kilometers. At 4 km,simulations become “cloud permitting” and individual clouds canbegin to be modeled without including them as subgrid parame-terizations. At higher resolutions of 2 km and 1 km, individual cloudbehavior can be fully modeled. Simulations at these resolutions willsubstantially reduce the level of approximation at the subgrid scaleand provide a more physically based representation of the behaviorof clouds. In the short term, results from these simulations will beused as the basis for increasing the accuracy of climate simulationsat coarser resolutions that can be run more efficiently to simulatelonger periods of time. GCRMs are likely to be used for operationalnumerical weather prediction within about ten years and toperform “time slice” simulations within longer climate changesimulations on coarser grids.

The GCRM will initially be run using a minimum of 80 Kprocessors, writing terabytes of data to disk. However, to approachclimate-length simulations at the target resolution of 4 km,a million processors will be required. IO will be a serious bottleneckon overall program performance unless careful consideration isgiven to designing a high performance IO strategy. Previously, themost widely used approach for handling such large IO require-ments has been to have all processors engage in IO, either toseparate files or to a few shared files using a parallel IO library.Having each processor write to separate files is undesirable, bothbecause it will result in a huge number of files and because havingthat many processors doing large writes simultaneously willoverwhelm the IO system. Similarly, having all processors write toa shared file is also likely to overwhelm the IO system. Morerecently, researchers have focused on creating IO collectives thataggregate data to a subset of processors before writing to disk. Thisallows programs to exploit the higher bandwidth available forcommunication to stage data to a smaller number of processorsbefore transferring it to a file. The smaller number of IO processorscan minimize contention while simultaneously maximizing IObandwidth. Several recent reports have described collectives of thistype, particularly in regards to theMPI-IO library (Lang et al., 2009).However, these optimizations are designed to handle all possiblesituations and may not come up with ideal solutions for everyproblem. Further improvements may be available by organizingdata at the application level. This paper will describe the imple-mentation of an IO API for a parallel GCRM that provides flexibilityin controlling the fields that appear in the output and the frequencythat output is written, while simultaneously allowing the user tocontrol the number of processors engaging in IO and the size of IOwrites. This extra layer of control provides additional options foroptimizing IO bandwidth.

There has been considerable recent work investigating parallelIO using all processors. Antypas et al. (2006) had each processorwrite local data to disk in a large run of the FLASH astrophysicalsimulation code using 65 K processors of an IBM BlueGene/Lmachine. However, this resulted in over 74 million files, severelycomplicating post-processing and analysis. An alternative toexporting local data is to use parallel IO libraries that allowmultipleprocessors to write to different locations within the same file.Recent implementations of such libraries include Parallel NetCDF(Li et al., 2003) and the HDF5/NetCDF4 libraries (Yang and Koziol,2006), both of which are in turn built on top of the MPI-IOlibraries (Thakur et al., 1999). These parallel IO libraries allowprogrammers to write files in a platform-independent format thatis widely accepted in the climate community. The API describedbelow is built around such libraries.

However, parallel IO libraries by themselves do not representa complete solution when very large numbers of processors areused. Antypas et al. (2006) found that IO did not scale for theFLASH code beyond 1024 processors on the IBM BG/L platform

using HDF5, parallel NetCDF, and even basic MPI-IO. Yu et al.(2006) investigated optimizations to MPI-IO, also on an IBM BG/L platform, that demonstrated scaling to about 1000 cores for theFLASH IO benchmark, as well as the HOMME atmosphericaldynamical code. Ching et al. (2003) investigated the effect ofdifferent low level data read/write strategies on IO performanceusing both FLASH and a variety of other non-application bench-marks. Although they found reasonable scaling behavior, theirstudies did not extend beyond 128 processors. Saini et al. (2006)also reported results for the FLASH IO benchmark using HDF5but saw effective bandwidth drop off significantly after about 128processors. Using a different astrophysical benchmark, MAD-bench2, based on a cosmic microwave background data analysispackage, Borrill et al. (2003) investigated read and write perfor-mance to both separate files for each processor and shared files ona broad range of platforms. They found that IO scaled for bothseparate and shared files in almost all cases, but only reportedresults up to 256 processors.

For very large numbers of processors, IO scaling behavior isunclear. Because total bandwidth to disk is a finite resource,contention between processors may actually lower bandwidthwhen large numbers of processors are all trying to write concur-rently (Mache et al., 1999; Saini et al., 2006). Antypas et al. (2006)and Saini et al. (2006) did not see scaling when going to highnumbers of processors using FLASH coupled with HDF5. It is notclear that the alternative of having each processor write its ownlocal data will scale either to petascale systems containing thou-sands or tens of thousands of processors. Furthermore the numberof files generated if each processor exports data becomes difficult tomanage (74 million in the case of a large FLASH simulation). Whilethese files could be post-processed back into a global view of thedata, this step will consume significant resources and introducesthe possibility of errors in the post-processing step. It may alsorequire double storing the data or discarding the raw data. A veryrecent study by Lang et al. (2009), however, has shown that opti-mizations to MPI-IO collective operations, including data aggrega-tion, has lead to scaling up to 100 K processors. This has beendemonstrated for several synthetic benchmarks as well as theMADbench2 and FLASH3 codes.

Additional libraries are under development for use in highperformance computing applications. PIO has been developed atNCAR to provide a common interface for several IO backends. ThePIO interface itself is similar to the parallel NetCDF and NetCDF4libraries, so the data remapping etc. described below is still requiredin order to use it. However, using PIO would allow users to switchseamlessly between several IO libraries (Dennis et al., in press). TheADIOS library being developed at Georgia Tech and ORNL (Lofsteadet al., 2008) also provides a common interface to several different IOlibraries and data formats, as well as implementing many optimi-zations designed to improve IO performance (Lofstead et al., 2009).However, to achieve these performance gains, ADIOS has created itsown BP format which requires that the data subsequently be con-verted into NetCDF or HDF5 formatted files for which analysis toolchains exist (Lofstead et al., 2010). These optimizations have lead todramatic improvements in applications that export data usingmany small writes butmaynot lead to such large performance gainswhen IO consists of large writes. The GCRM code writes out data inlarge blocks so improvements over parallel NetCDF or HDF5/NetCDF4 may be harder to achieve, particularly when the cost ofreformatting data is factored in.

Although optimizations to MPI-IO, such as aggregation of manysmall IO requests into larger single IO reads/writes and staging ofdata to a smaller number of IO processors, have lead to significantperformance gains these optimizations do not always identify idealsolutions in all cases. Additional performance may be gained by

Page 3: Efficient data IO for a Parallel Global Cloud Resolving Model

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e1735 1727

further manipulation at the application level. The API described inthis paper reformats data from the native application layout toallow it to be written to disk in large contiguous chunks andprovides users significant extra flexibility in configuring how theapplication aggregates and stages data for IO. The extra flexibilitycan lead to substantial performance gains in some instances overthe optimizations in the parallel IO libraries themselves.

The remainder of this paper will describe the GCRM IO APIlibrary (GIO), including: the data layout of the GCRM code itself, thedata layout for files written by the API, the user interface to the API,the communication strategies for moving file to the IO nodes, andperformance results for the API using a number of differentcommunication strategies.

2. The geodesic grid and GCRM data layout

The GCRM code that will incorporate the IO API is being devel-oped at Colorado State University. Because the GCRM developmentis occurring in parallel with the development of the IO API, theresults reported here will be for simulations that were performedusing a GCRM predecessor. This is a hydrostatic simulation code(HYDRO) that uses the multi-level grid solver that will be incorpo-rated into theGCRM. This codeuses a geodesic grid,which is the gridthatwill be used by theGCRM, andhas a similar internal data layout.

Unlike the latitudeelongitude grid traditionally used in manyclimate studies (which has well-known singularities at the poles),the geodesic grid has the desirable property of tiling the surface ofthe globe with hexagonal cells of relatively uniform size. Thegeodesic grid also contains a surprising amount of regular structurethat can be exploited in laying out data on a parallel computer. Thegrid starts from a regular dodecahedron, which is a platonic solidwith 20 equilateral triangles as faces (Fig. 1(a)). The vertices of thisfigure can be circumscribed inside a sphere. The geodesic grid iscreated by decomposing each triangular face into four new trian-gles by bisecting the edges of the triangle and joining themidpointsof each edge to each other (Fig. 1(b)). The new vertices of thetriangles are then projected onto the surface of the sphere, and theprocess is repeated until the desired resolution is reached. The cellsare constructed by using the vertices from the bisection procedureas the cell centers and applying a standard Voronoi construction toget the cell boundaries (Fig. 1(c)). For a more detailed description,see Randall et al. (2002).

If the original triangles are paired so that each pair shares anedge, then the grid points generated from each pair form a logicallystructured block (Fig. 1(d)). These blocks can be stored as a set ofregular square arrays of data points. The original 20 triangles form10 of these square panels. The recursive bisection algorithm impliesthat the dimension of each of the square panels is a power of two,which further implies that the total number of cells N in thegeodesic grid is given by the formula

N ¼ 10� 22R þ 2

Fig. 1. Decomposition and data

The extra two cells correspond to the north and south poles. A 4 kmresolution grid corresponds to a value of R ¼ 11. A list of R valuesand their corresponding resolutions is given in Table 1.

The GCRM divides each square panel into a number of squaresub-blocks. The requirement that the sub-blocks are also squareand all the same size implies that the number of sub-blocks perpanel is a power of 4. The total number of sub-blocks M is given bythe formula

M ¼ 10� 4Q

where Q is some integer. The number of processors that the currentGCRM runs on must be an even divisor of M, which placesa restriction on the number of processors. This requirement guar-antees that each processor has the same number of sub-blocks.Generally, the GCRM can be run on some power of 2 times 10processors (e.g. 10, 20, 40, 80, 160, etc.), but a few other values willwork as well (e.g. 8, 128, etc.). The north and south pole are alsostored on two of the processes and are included in the ghost cellregion of two of the sub-blocks.

3. Mapping output to NetCDF

Until recently, climate and weather models have primarily beensimulated on structured grids that divide the latitude and longitudeaxes in even increments, resulting in logically structured simula-tion grids. Standard conventions for describing this data in theNetCDF data model have been formalized by the Climate andForecast (CF) conventions. CF defines conventions and metadatastandards that enable both human and computer interpretation ofthe data. Human interpretation is supported through the use ofstandard names while definitions of spatial and temporal proper-ties of the data have enabled an extensive set of tools for datamanipulation and display (NCO, OPeNDAP, Ferret).

From the previous description, it can be seen that the geodesicgrid is fairly regular. However, the horizontal dimension has someimportant properties in common with unstructured grids: the gridcoordinates are not monotonic and simple conventions are notavailable for identifying the neighbors of all cells. As a consequence,it is necessary to provide more information about the topology ofthe grid. Other unstructured grids such as triangular, cubed sphere(Adcroft et al., 2004), and arbitrary unstructured polygons are alsobeing applied to various models. There is a recognized need toextend the CF conventions to unstructured grids so that generaldata analysis, regridding, and display tools can be developed. Asyet, consensus on such a standard has not been reached. Balaji et al.(2007) has cataloged a number of unstructured grids and proposesa tiling approach to describing grids.

Our data model is designed to support efficient model output,fully describe the grid topology, and provide sufficient informationfor tessellation to triangles for 3D visualization. A couple ofimportant points should be noted about our model. First, for model

layout for a geodesic grid.

Page 4: Efficient data IO for a Parallel Global Cloud Resolving Model

Table 1Relationship between the grid parameter R and the grid resolution.

R Number of Cells Grid Resolution (km)

7 163,842 55.98 655,362 27.99 2,621,442 14.010 10,485,762 6.9811 41,943,042 3.4912 167,772,162 1.7513 671,088,642 0.873

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e17351728

output performance purposes, the horizontal dimension appears asthe first in the array specification of each variable. Apart from time,this is the slowest dimension in NetCDF specifications of arrays.This minimizes striding at write-time since all other indices (e.g.vertical dimension and vector components) are fully containedwithin each processor. There will be tradeoff performance costs foranalyses that operate on a subset of the vertical dimension. Forsome 3D fields it may be true that most subsequent analyses will bedone on 2D slabs, in which case the ordering described above willbe inefficient. For these cases, a separate 2D field could be exportedto the file. A second issue is the need to support 64 bit integers todescribe extremely large arrays. These are needed internally toevaluate offsets and addresses and may even be require at theinterface level to describe extremely large arrays. Because of thisand for clean backward compatibility with the NetCDF APIs, largedata (>64 bit addressing) can only be processed on 64 bit machines.

4. API design

This section will describe in detail the design and imple-mentation of the IO API. The structure of the API code is illustratedschematically below in Fig. 2. The IO API layer consists of a smallcollection of subroutine calls that are used to connect the IO libraryto the GCRM application. These subroutines are supplemented bytwo files. One provides a description of the data that will be writtenby the IO routines and the second describes the files that will bewritten. A large part of incorporating the IOAPI into the GCRM code

Fig. 2. Schematic diagram of

is associated with the API initialization, which connects the data inthe GCRM to the API via a registration mechanism. After initiali-zation, the only calls to the API are through a driver function. Thebehavior of this function is controlled by user supplied informationin the output configuration files that details what files are to bewritten, what data should be contained in the files, and howfrequently it should be written. The driver function is responsiblefor reformatting the data from its local array structure to the line-arized data format described below. The linearized data is sent tothe IO processors and then passed to a subroutine that is respon-sible for calling the appropriate subroutines in the parallel IOlibrary that actually open shared files and write data to disk.

The entire API is modular, making it relatively easy to switchbetween different messaging algorithms and to use different IOlibraries. Modules that write out data using both the parallelNetCDF and HDF5 libraries have been written. The flexibility of theAPI interface itself and the use of files for both describing data andspecifying output file contents and frequencies allows GCRMdevelopers to add or subtract data from the API as the GCRMmodelitself evolves. Given the uncertainty regarding data format stan-dards and the direction of parallel IO libraries, this flexibility isextremely useful and shouldminimize the overhead of allowing theIO layer to evolve as both the application and the parallel IOlibraries change. The flexibility in specifying the contents andfrequency of the output files will also provide people running theGCRM with many options for managing their overall data volumes.This will be important given the extremely large data sizes thatcould potentially be generated from high resolution simulations.

4.1. Output data layout

An early consideration in designing the API was choosing a filelayout for the data that would support high performance in the IOlayer andwould be independent of the number of processes used torun the GCRM. The fact that the geodesic grid consists of 10 squarepanels plus the two poles suggested that data be organized bypanels. The large scale structure of the layout for each field is towrite out the data for the two poles followed by the data for each of

structure of the IO API.

Page 5: Efficient data IO for a Parallel Global Cloud Resolving Model

Fig. 4. Illustration of Morton-ordering curve for a 16 � 16 array of points.

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e1735 1729

the panels. This is illustrated schematically in Fig. 3. Within eachpanel the situation is more complicated. Writing out the data asstandard column-major or row-major arrays will result in stridedwrites that are composed of multiple separate segments, whichdrastically reduces the efficiency of individual writes. It is muchmore desirable to write data in large, contiguous chunks. However,simply writing out local blocks to contiguous segments within eachpanel region will lead to files where the data organization changesif the number of processors in the simulation changes, since thesize of local blocks changes with the number of processors.

The data within each panel can be organized using a Morton-ordering scheme (also referred to as Z-ordering) (Morton, 1966)to index each cell within the panel. This has the advantage that itis much easier to guarantee that each write represents a contig-uous chunk in the file. The Morton-ordering scheme works verywell for square arrays that are an integer power of 2 in dimensionand is illustrated schematically in Fig. 4. Each array element isindexed by successive locations in the self-similar space-fillingcurve. The figure shows a panel that has been divided up into 16blocks. Note, however, that if the panel had been divided into only4 blocks, the points within each block would still lie alonga continuous segment of the space-filling curve. Self-similarity isan important property of the Morton index scheme since it allowsusers to change the number of processors in the simulationwithout changing the layout of the output files. The self-similarcurve also has the useful property that an arbitrary segmenttaken out of the sequence of Morton indices will generallycorrespond to a set of cells that are close to each other in space.Conversely, a compact region of space will tend to contain cellswith indices that are close to each other. This property canpotentially boost the performance of analysis routines by ensuringthat cells that are close to each other on the surface of the sphereare also close to each other in memory.

The Morton index for each cell in the panel can be constructedby taking the zero-based i and j indices of the cell location in thepanel and interleaving the bits of the binary representation of i andj to get a new number. This is illustrated schematically in Fig. 5 fora hypothetical cell at location (14,9) on a 16� 16 panel. The Mortonindex for this cell turns out to be 233.

The mapping described above effectively linearizes the cellindices over the surface of the globe into a single one-dimensionalformat. However, the cell indices are not sufficient to describea particular data point. Most of the data generated by the GCRMwillhave a vertical dimension and many data have additional compo-nent indices associated with them. This includes corner andneighbor lists for each cell and other structures that describe thegrid. Multiple snapshots at different times can bewritten to the file,introducing an additional time index. Apart from the cell index,data associated with different values of the remaining indices(apart from time) are collocated on the same processor. Thus, allvalues in a vertical column above a given cell are on the sameprocessor. Data is only distributed over the cell index. To guaranteethat all data written to the file is contiguous for each write, the cellindex should be the slowest index in the file (again, apart fromtime). This means that any level, component, or neighbor indices

Fig. 3. Schematic of data layout for a single variable record in a file.

should be grouped as the fastest indices, then cell index, with timelast (assuming that the field has a time dependency).

In addition to data associated with the centers of cells, there isalso data associated with cell corners and cell edges. A property ofthe geodesic grid is that all grid cells, with the exclusion of thenorth and south poles, have exactly two unique corners and threeunique edges associatedwith them. The north and south poles haveno unique edges or corners. Because of this property, the mappingfor corner and edge data is very similar to that for cell-centereddata. The main difference is that the values for the north andsouth pole are not inserted at the start of the record and two uniquecorner or three unique edge values are written for each cell in the10 panels. The corner or edge values for each cell are writtenconsecutively before moving on to the next cell. Like the cell-centered data, corner or edge data has the corner or edge indexin the slowest position, apart from time.

4.2. Registering data

Data registration has two parts: describing the data to the APIand notifying the API about the location of the data in memory. Thedata descriptions are stored in a data configuration file that is readwhen the API is initialized. Each data descriptor fully definesa single variable and consists of a name for the data item, a data

Fig. 5. Interleaving bits to form a Morton index. An i index of 14 in a 16 � 16 arraycorresponds to the binary representation 1110 and a j index of 9 corresponds to 1001.The interleaved binary representation is 11101001, corresponding to a value of 233.

Page 6: Efficient data IO for a Parallel Global Cloud Resolving Model

Table 2Subroutines supported by the IO API.

Subroutine Description

gio_init Initialize the IO API and supply basic informationon the size of the grid, the size of the blocks on

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e17351730

type (integer or real), the grid cell location (cell center, corner, edge)and other grid dimensions. Unrecognized keywords withina descriptor are interpreted as metadata attributes to be associatedwith the variable. Two examples of the data descriptors are shownin Fig. 6a.

The first, grid_center_lat, is the latitude location of the grid cellcenters. The data is given a name (used to refer to the field whenspecifying which files are written out and which data they shouldcontain), the units for the data, the data type (float or integer), anda descriptive name that appears in the output files. The latitudevalues of the cell centers appear in the output as a one-dimensional array and the dimension of this array is given bythe dimension attribute and has the value “cells”. The numericalvalues of the dimensions are assigned by the API when theapplication code calls the API initialization routines. The stand-ard_name and bounds attributes demonstrate the capability of theAPI to tag variables with additional, arbitrary metadata. In thiscase, the metadata follows CF conventions for making the dataself-describing.

The grid_center_lat array is a property of the grid and isinvariant over time. The pressure variable, on the other hand,changes as the simulation proceeds and so it has ‘time’ as one of itsdimensions. The pressure is also defined on the interface betweenvertical layers for each cell, so it has the interfaces dimension aswell. Cell variables can be located vertically either in the center ofa vertical layer or at the interface between two consecutive layers.For variables defined at the interface, the ‘interface’ dimension isused. The “coordinates” attribute is another example of arbitrarymetadata that, in this case, enables mapping of the pressure vari-able to its grid variables. An alternative to using the descriptor filewould be to add these attributes to the data via additional functionsin the API interface, but we have found that the descriptor file is aneasy and flexible way of managing data attributes.

After reading the data configuration file, the API has an internallist of data descriptors. The descriptors must then be connected toactual data inside the application code. This can be done by callingregistration functions from inside the application. The strategyused here was to collect all the registration calls into a singlesubroutine that is called after the API initialization. The registrationfunctions are designed to connect the data descriptors, createdwhen the API read the data configuration file, to the memorylocation for internal blocks of data used by the application.

The use of the data configuration file substantially lowers thenumber of functions that need to be included in the API. A completelisting of functions supported by the API is listed in Table 2 andincludes less than a dozen subroutines.

a

Fig. 6a. Example variable descriptor configuration file.

The need to closely couple the data configuration file with thedata registration functions in the application code means thatwriting the data configuration file should be considered as part ofthe code development process and that the file is not meant to bemodified by application users. In this regard, the data registrationfile is similar to other files read in by the application at startup, suchas adjusted grid coordinate files, that are not meant to be modifiedby users.

Although the fields available for export are dictated by the appli-cation via the data descriptorfiles anddata registration calls, users aregiven considerableflexibility in specifyingwhat data canbewritten tofiles via a second set of configuration files e the output configurationfiles.Anoutput configurationfile allowsusers to specifywhatdatawillbe written, what file to write to, and how frequently a file should bewritten. An example file entry is shown in Fig. 6b.

All file descriptors specified in the configuration file start withthe name given by the base_name attribute and are located in thedirectory given by the base_directory attribute. The date of the firsttime record in the file is appended to the base name to give theactual file name appearing in the base directory. The frequencykeyword indicates how often (in seconds) the data should bewritten out and the n samples keyword indicates how many timerecords should be included in a single file. After 6 time records havebeen written the file in this example is closed and a new filecreated. This feature is designed to control the size of output files.The remaining entries in this example list the fields that are con-tained in the file. The model user can choose to put multiple vari-ables in a single file, one variable per file to create single variabletime-series data, and to put the grid in a separate file or in each file.This flexibility enables model users to configure IO in many ways,including writing different data at different frequencies.

4.3. IO strategy

The IO API supports a blocking IO library that is designed toexport data to shared files at selected points in the calculation usingcollective operations from the parallel NetCDF library. Because theAPI is blocking, the effective bandwidth for writing files must bevery high in order to keep IO from becoming a significant overheadon the entire GCRM calculation. The current goal is to keep the time

each processor,the number of IO processors, etc.

gio_terminate Close the IO API and clean up resourcesgio_driver This routine is called at every timestep and

supplies the current time to the API. Based on theinformation in the file configuration file, the APIdecides which files need to be written to.

gio_grid_setup Supplies information to the API on the locationof data fields describing the grid

gio_grid_setup_pole Supplies information to the API on the locationof data fields describing the poles

gio_register_dfield

gio_register_dpole

gio_register_ifield

gio_register_ipole

The registration subroutines supply informationto the API on the locationin memory of data fields thatare described in the data configuration files.The dfield and dpole subroutines are for doubleprecision fields, the ifield and ipole functions arefor integer fields.

gio_register_dlevel This subroutine is used to specify the location ofdata fields that describe the vertical layers.These fields are assumed to be replicatedacross processors

Page 7: Efficient data IO for a Parallel Global Cloud Resolving Model

b

Fig. 6b. Example entry in the output configuration file.

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e1735 1731

spent in IO at less than 5e10% of the overall execution time. The APIwas chosen to be blocking because the MPI message-passingcommunication model that is used for both the GCRM and theAPI, as well as inside the parallel NetCDF and MPI-IO libraries, doesnot have strong support for the onesided communication thatwould be necessary to create a non-blocking API. Alternatively,a non-blocking API could also be written using multithreading tospin off additional threads within a process that would handle IOindependently of the main computation. Either approach could beused to decouple the main computation from IO so that both couldrun concurrently. This would substantially lower the bandwidthrequirements for IO but would raise memory requirements sincethe data targeted for IO would need to be copied to a separatebuffer until it could bewritten to disk. It is also not clear what effecthaving separate IO threads running on the processor would have onthe main computation. It is still possible that the IO thread couldslow down the main computation enough to nullify the advantagesof overlapping IO and computation.

To maximize IO bandwidth, the IO API is designed to allow usersto designate a subset of processors for IO instead of trying to haveall processors writing to a shared file simultaneously. The originalgoal was to minimize contention for resources by not having allprocessors on a large job trying to access disk at the same time.Recently, optimizations of this type have been built directly intoparallel IO libraries but these represent generic algorithms thatmay not work optimally in specific cases. As described below, theadditional flexibility available in allowing the application toconfigure which processors perform IO and how the data is stagedto these processors can result in substantial performance gains,particularly at large processor counts.

To implement this strategy, it is necessary to collect data on theIO processes using message-passing implemented with the MPIlibraries. The API does not make any assumptions about which datablocks are located on which processors. The only restriction is thatall processors contain the same number of data blocks. Thegenerality of the data layout complicates the task of setting upa message schedule. As a result, two messaging strategies weredeveloped. The first is relatively straightforward to implement andputs most of the burden for organizing the data on the parallel IOlibrary. The second messaging strategy is more complex, butpermits additional optimizations, such as aggregating smallerwrites into a single large write. In general, it is preferable toperform fewer writes on larger blocks of data, although individualwrites that exceed a certain threshold may be decomposed intosmall blocks on some systems.

The starting point for either messaging strategy is to assign eachof the M blocks in the geodesic grid an index from 0 to M � 1. The

index is based on block location and follows the Morton-orderingscheme described above. At initialization, all processes exchangeinformation describing which processes own which data blocks.Based on this information, the non-IO processes can determinewhere they need to send their data and the IO processors candetermine from where they will be receiving data. A complicatingfactor is that IO processors may also be sending data to otherprocessors because in the general data layout model, there is noguarantee that an IO processor is responsible for writing the blocksthat currently reside on it.

The simplest messaging scheme, referred to hereafter as directmessaging, is to assume that there are a total of P processors with Kprocessors doing IO so that L ¼ P/K processors are sending data toeach IO process. If I is an IO process then processes I,.,I þ L � 1 areall sending data to process I. This scheme has the advantage that noIO process is sending messages to another IO process. This is a keysimplification since the writes are, or at least can be, collective onthe IO processors. The most difficult aspect of creating a messageschedule is avoiding message deadlocks because a sendereceivepair is split by a collective write.

After each process has (1) determined which blocks it isreceiving and fromwhich processors it is receiving them as well as(2) which blocks it is sending and their destination processes, eachprocess orders the sends and receives so that they are ranked fromlowest block index to highest block index. This guarantees that forevery receive, the corresponding send has been posted somewhere.The block indices are used as tags for the sendereceive pairs,thereby guaranteeing that each sendereceive pair has a unique tag.

The IO processors will have both sends and receives but for thedirect messaging scheme, the send-receive pairs are on the sameprocess (the IO process). For this case, the sends are not posted anda receive is translated into a simple copy operation from theapplication data buffer to the receive buffer. The copy operationinto the IO buffer, which occurs for all data blocks, is where the datais reordered from the internal 2-dimensional format used by theapplication to the linear Morton ordered format used in the files. Itis also the point at which double precision data in memory isreduced down to single precision data for file storage. Storing datain single precision can cut files sizes by almost half, as well asreducing the time spent writing them to disk. For regular sends toother processors, the copy operation is performed when movingdata to the send buffer. After each receive the contents of thereceive buffer are passed to the parallel IO library and written to thefile. The entire messaging scheme is illustrated schematically inFig. 7(b).

Because the direct messaging scheme cannot guarantee thatblocks are being received with consecutive block indices on the IOprocessor, the maximum size of each write is dictated by the size ofthe blocks. For large processor counts, this results in small writeseven for very high resolution grids. Larger write volumes shouldresult in higher performance (Thakur et al., 1999) so it is desirableto aggregate the data frommultiple blocks into a single large write.To do this, blocks must be received on the IO processors in order ofconsecutive block index. The secondmessaging scheme, referred toas interleaved messaging, is designed to do this, but it results ina much more complicated ordering of the sends, receives, andwrites.

As with the direct messaging scheme, all processors in theinterleaved scheme exchange data to determine which processorsare sending data and to whom and which processors are receivingdata and from whom. The IO processors will each be receivingJ ¼ M/K consecutive blocks of data. For example, if J ¼ 8 then thefirst IO process receives blocks 0e7, the second IO process receivesblocks 8e15, etc. Sends and receives must again be ordered.However, since IO processors can now exchange messages, it is

Page 8: Efficient data IO for a Parallel Global Cloud Resolving Model

Fig. 7. Schematic of direct and interleaved messaging schemes for two GCRM panels. Each panel has 4 IO processors associated with it. For (a) the indices for each data block areshown, for (b) and (c) the processor IDs are shown. The captions at the bottom list which data blocks (based on block indices) are written for each IO processor. Figure (b) shows themessaging for the direct messaging scheme, figure (c) shows the interleaved messaging scheme.

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e17351732

necessary to interleave the sends and receives to avoid deadlocksthat can occur if a receive is posted that is unmatched with a cor-responding send. This can be accomplished by ordering the sendsand receives on each process by the value of the block indexmodulo J (hereafter, these values are designated by j). If two sendsor receives on the same process have the same value of j then theyare ordered by the actual value of the block index. On the non-IOprocessors, all sends are posted using this ordering. On the IOprocessors, the sends and receives are posted in the order of theirvalues of j. Writes are only executed on the IO process after alltransactions corresponding to a given value j have been executedand before any of the transactions corresponding to the next valueof j have been posted. This guarantees that no write, which can bea collective on the IO nodes, splits a sendereceive pair. Note thateach IO processor will step through the J consecutive values of theblock indices that it has been assigned using this scheme. Thisprovides an opportunity to aggregate some number of these blocksinto larger writes for higher IO performance. The interleavedmessaging scheme is illustrated schematically in Fig. 7(c).

A final comment on the code as a whole is to note that both themessaging module and the IO module responsible for actuallywriting the contents of the IO buffers to the file have been highlyencapsulated. This has made it possible to create several versions ofthese modules that can be interchanged with each other relativelyeasily. Not only does this support the two messaging modelsdescribed above, but it has alsomade it possible to explore differentIO libraries as well. Although this paper will only discuss results forparallel NetCDF, an IO layer based on NetCDF4 (which sits on top ofHDF5) has also been written for the IO API.

5. Results

This section will provide a brief summary of current perfor-mance results for the IO API. A more detailed analysis of IO for thisapplication is planned for a separate paper. A series of test caseswere performed using the HYDRO model configured to run theJablonowski test case originally described by Jablonowski andWilliamson (2006). Most of the simulations were run using 2560processor cores on an R11 grid (4 km resolution). The number of IOprocessors was varied from between 160 and 2560 processorsdoing IO and both the direct and interleaved messaging schemeswere investigated. All writes using the parallel NetCDF library weredone using collective IO. Some tests were done using independentIO but these consistently showed worse performance than testsusing collective writes. Note that for tests using direct messaging,

where the number of IO processors is equal to total number ofprocessors, there is no application staging of data and the appli-cation is relying entirely on the parallel NetCDF andMPI-IO librariesto aggregate and stage data optimally for IO. For interleavedmessaging, 8 data blocks were aggregated into single largewrites. Itis anticipated that production runs of the GCRM will usea minimum of 10,240 processors, so a few additional runs wereperformed to check the API performance in this range. The runswere all performed on NERSC’s Franklin supercomputer. Franklin isa Cray XT4 machine with 38,228 cores. It is attached to a Lustre filesystem that is split into two separate partitions, each of which hasa theoretical bandwidth of 16 GB/s and a measured bandwidth ofabout 12 GB/s, based on the IOR benchmark developed at LLNL. TheCray XT4 is the initial architecture being targeted for productionruns of the GCRM.

For the 4 km resolution grid, cell-centered fields with verticallevels represent approximately 4.2 GB of single precision data.Corner- and cell-centered fields are 2 and 3 times as much datarespectively. Fields were written individually to separate files and 6snapshots per file were stored before creating a new file. Thisresulted in approximately 25 GB per field for a cell-centered fieldand correspondingly larger sizes for corner and edge variables. Thesize of individual data blocks is about 1.6 MB on 2560 processorsand 0.4 MB on 10,240 processors. For the direct messaging scheme,individual blocks are written in separate write calls so the writesizes correspond to the size of data blocks, for the interleavedmessaging scheme, writes are 8 times larger. The files also con-tained grid data describing the location of cell centers, corners,neighbor lists, etc. but these are a small fraction of the total data.For these tests, a total of about 590 GB is written to disk.

Because of optimizations in many IO libraries, including buff-ering of writes, it is often difficult to get meaningful numbers for IOperformance. The approach taken in this paper is to put timers onall calls to the IO API and use this to determine the total time spentin the API on all processors. Other accumulators are used to keeptrack of the total number of bytes written to disk. The combinationof total time spent in the API and total number of bytes written todisk can be used to determine an effective bandwidth for the IOAPI.This bandwidth includes the overhead of interprocessor commu-nication, opening and closing files, and reorganizing the data, so theeffective bandwidth is lower than what would be seen for indi-vidual writes. However, this is the bandwidth that is seen by themain application so it is themost appropriate number to use for theapplication developer. Individual parts of the IO have been moni-tored separately and provide approximate estimates for the

Page 9: Efficient data IO for a Parallel Global Cloud Resolving Model

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e1735 1733

amount of time spent in communication, copying and reformattingdata, and opening and closing files. These measurements indicatethat by far the bulk of time spent by the API is in writing data frombuffers to disk.

A summary of the results on 2560 processors using the directmessaging scheme is shown in Fig. 8. Results for the interleavedmessaging scheme are shown in Fig. 9. Two results using directmessaging on 10,240 processors are also included in Fig. 8. The solidsymbols in both Figs. 8 and 9 show the results of running on a fixednumber of processors (2560) and varying the number of IOprocessors. A few additional calculations using a total of 10,240processors were done using the direct messaging scheme. Theresults indicate that the direct messaging scheme is more effectivethan interleaved messaging and that bandwidth increases withincreasing processor count, up to about 1280 IO processors. At 2560IO processors, bandwidth appears to tail off somewhat suggestingthat contention may be starting to degrade overall IO performance.A few runs using direct messaging were repeated multiple times todevelop an estimate of bandwidth variability. These repeat runsindicate that bandwidth can fluctuate by around 1 GB/s. Experienceon Franklin suggests that IO performance for any one applicationcan be heavily influenced by other applications that are runningconcurrently and based on the variation in repeat runs, this appearsto be happening on these benchmarks. The differences seenbetween the interleaved and direct messaging schemes fall withinthe range of these variations, but given the consistency of thedifferences it is probably safe to conclude that direct messaging ismore effective then interleaved messaging, at least at this time.

Two additional runs at 10,240 processors using the directmessaging scheme are also shown in Fig. 8. One run used 1280 IOprocessors, matching the optimal value found in the 2560 bench-marks and the other run used all 10,240 processors for IO. Thesecond run leaves all optimization for IO up to the IO libraries. Theresults using 1280 IO processors are comparable to those obtainedfor 2560 total processors. The slight drop in bandwidth could bedue either to fluctuations in the IO rate similar to those seen for2560 processors or to increased overhead from communication.There is a substantial drop in IO performance, by well over a factor

Fig. 8. Plot of bandwidth numbers for the IO API as a function of the number of IOprocessors for the direct messaging scheme. The solid symbols are run on 2560 totalprocessors, the two open symbols were run on 10,240 total processors. The bottomaxis shows the number of IO processors used in the calculation.

of 2, in going to 10,240 IO processors, indicating that for very largenumber of processors, there is still substantial benefit to reducingthe number of processors doing IO in the application layer and notleaving it entirely to the IO libraries. Although the bandwidthresults have shown a high degree of variability, a factor of 2, rep-resenting a difference of over 2 GB/s, is significant based on thefluctuations that we have seen.

Although these results suggest that there is a substantial benefitto staging data for IO in the applications layer, it is not clearwhether this will remain true in the long term. Work is activelyunderway to improve data aggregation and staging in the MPI-IOlibraries as well as higher level IO libraries such as parallelNetCDF and HDF5. Over the course of API development, the band-width numbers that have been obtained for IO have varied signif-icantly as hardware has been added to the system and the parallelNetCDF and underlying MPI-IO libraries have been improved. Inaddition, the relative performance of the direct and interleavedmessaging schemes has varied as well. The number of IO processorsthat yield optimal bandwidth has tended to grow over time. Thishas partly reflected increases in hardware resources that have beenadded to the system, but is also likely due to improvements in theunderlying IO libraries. As these libraries become more sophisti-cated at probing the system for available hardware, staging data toan optimal number of IO processors, and aggregating messages, theneed to do this in the API layer decreases. However, the fact that IOperformance appears to decrease on going from 1280 to 2560 IOprocessors and the large drop in performance in going to 10,240 IOprocessors suggests that there are still further improvements to bemade in the parallel IO libraries. As parallel IO libraries improve,more of these user level optimizations can likely be discarded but atthe present they are still useful. At the present time, we have beenable to boost performance by additional staging in the API layerover what is available from the lower level IO libraries alone.Overall, the changing performance of the IO API and other parallelIO libraries means that it will be necessary to periodically reassessthe optimal configuration for obtaining the largest IO bandwidthnumbers.

Fig. 9. Plot of bandwidth numbers for the IO API as a function of the number of IOprocessors for the interleaved messaging scheme. A total of 2560 processors are usedfor all calculations. 8 blocks of data are being aggregated together before each write.The bottom axis shows the number of IO processors used in the calculation.

Page 10: Efficient data IO for a Parallel Global Cloud Resolving Model

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e17351734

Because of the large size of the data, verifying that the dataexported to the files is correct has presented a challenge. To date,the primary validation method has been through using visualiza-tion of the data fields exported to files to verify that the fieldswritten to file are smooth and show no gaps are sharp edges thatwould indicate a problemwith the data. Longer runs of the HYDROcode on a smaller R9 grid indicate that the data written to disk bythe API is the same as the results reported by Jablownowski andWilliamson (2006). The number of data fields was selected tomimic the expected output of the GCRM (approximately 11 3D cell-centered fields, 4 corner fields, and 5 edge fields, plus numerous 2Dfields describing the grid). For these longer runs, IO is consuming1.75% of the execution time.

6. Conclusions

This paper has described the development and implementationof an IOAPI for a GCRM application code that is expected to produceoutput on the order of petabytes. Several issues needed to beaddressed in order to develop such an API. These include deter-mining what grid data and other metadata is required in the outputfiles so that subsequent analyses and visualization can be performedon thedata, developinga format for thedata, developingan interfacefor the API, and creating algorithms to optimize IO performance.

The actual user interface consists of three components. The firstis a relatively small library of functions that are used to initializeand terminate that API, to register application data in the API, anda driver function that triggers writing of files. The second compo-nent is a data configuration file that is used in conjunctionwith theregistration functions to describe data in the application and thethird is an output configuration file that allows users to specify thenumber of output files, the contents of the files, and the frequencyat which files are written.

Apart from the API, there is also a need to develop a standard listof grid attributes and data structures that enable subsequentanalysis and visualization of the data. These standards, as well asnaming conventions for the data, are well established for theconventional latitudeelongitude grids used in many atmosphericand climate simulations but are only just starting to be developedfor non-regular grids such as the geodesic grid. The work done onthis API has been occurring in collaboration with other bodies thatare seeking to develop such standards. Along with developing a listof fields and attributes for the grid, a linearization scheme for thegrid data was also developed. This format is comparable toconventional unstructured grid formats but also supports aggre-gation of the data and eliminates striding in the writes to file, bothof which should support higher bandwidth to disk.

The implementation of the API is highly modular and this hasmade it possible to investigate different messaging schemes anddifferent parallel IO libraries without having to restructure largesections of the code. To date, two messaging schemes for sendingdata to the IO processors have been investigated and two differentparallel IO librarieshavebeen implemented in theAPI. Results on theNERSC Franklin machine to date indicate that different messagingschemes and data aggregation have relatively little impact on IObandwidth. Adjusting the number of processors doing IO has a largeimpact on IO bandwidths. However, given the rapidly changingnature of IO in high performance computing, the need for applica-tion level staging may disappear at some point in the future. Thecurrent implementation is able to achieve an effective bandwidth ofover 4 GB/s, representing a substantial fraction of the maximumavailable bandwidth on the Franklin machine. The tests so far havebeen relatively limited but work is underway to evaluate the API atlarger scales and on multiple platforms. This includes the Jaguarmachine at Oak Ridge and the IBM BlueGene/Pmachine at Argonne.

Acknowledgments

The authors are indebted to Katie Antypas, Prabhat, and MarkHowison at the National Energy Research Scientific ComputingCenter (NERSC), Dave Knaak at Cray, Rob Latham and Rob Ross atArgonne National Laboratory, and Professor Wei-keng Liao atNorthwestern University for invaluable help in getting the IO APIrunning and optimized on the Franklin platform.

This work was funded by the U.S. Department of Energy’s (DOE)Office of Advanced Scientific Computing Research through itsScientific Discovery through Advanced Computing program. Aportion of this work was performed using the Molecular ScienceComputing Facility in the William R. Wiley Environmental Molec-ular Sciences Laboratory, a national scientific user facility spon-sored by DOE’s Office of Biological and Environmental Research andlocated at the Pacific Northwest National Laboratory, operated forDOE by Battelle Memorial Institute under Contract DE-AC05-76RL01830. The remainder of this research was performed at DOE’sNational Energy Research Scientific Computing Center.

References

Adcroft, A., Campin, J.M.,, Hill, C., Marshall, J., December 2004. Implementation of anatmosphereeocean general circulation model on the expanded spherical cube.Monthly Weather Review 132 (12), 2845e2863.

Antypas, K., Calder, A.C., Dubey, A., Fisher, R., Ganapathy, M.K., Gallagher, J.B.,Reid, L.B., Riley, K., Sheeler, D., Taylor, N., 2006. Scientific applications on themassively parallel BG/L machine. In: PDPTA 2006, pp. 292e298. June 26e29,Las Vegas, Nevada.

Balaji, V., Adcroft, A., Lian, Z., 2007. Gridspec: A Standard for the Description of GridsUsed in Earth Systems Models. Workshop on Community Standards forUnstructured Grids, Oct 16,17, 2006.

Borrill, J., Oliker, L., Shalf, J., Shan, H., 2003. Investigation of leading HPC I/Operformance using a scientific-application derived benchmark. In: Proceedingsof the 2007 ACM/IEEE Conference on Supercomputing (SC’07), November10e16, 2003 Reno, Nevada. IEEE Computer Society, Washington, DC.

Ching, A., Choudhary, A., Liao, W., Ross, R., Gropp, W., 2003. Efficient structured dataaccess in parallel file systems. In: Proceedings of the IEEE InternationalConference on Cluster Computing (CLUSTER ’03). December 1e4, 2003, HongKong, China. IEEE Computer Society, Washington, DC.

Dabdub, D., Seinfeld, J.H., 1996. Parallel computation in atmospheric chemicalmodeling. Parallel Computing 22 (1), 111e130.

Drake, J., Jones, P., Carr, G., 2005. Overview of the software design of the communityclimate system model. International Journal of High Performance ComputingApplications 19 (3), 177e186.

Dennis, J., Edwards, J., Loy, R., Jacob, R., Mirin, A., Craig, A., Vertenstein, M. Anapplication level parallel I/O library for earth system models. InternationalJournal of High Performance Computing Applications, in press.

Hammond, G., Lichtner, P., 2010. Field-scale model for the natural attenuation ofuranium at the Hanford 300 area using high performance computing. WaterResources Research 46, W09527.

Jablonowski, C., Williamson, D.L., 2006. A baroclinic instability test case for atmo-spheric model dynamical cores. Quarterly Journal of the Royal MeteorologicalSociety 132, 2943e2975.

Lang, S., Carns, P., Latham, L., Ross, R., Harms, K., Allcock, W., 2009. I/O performancechallenges at leadership scale. In: Proceedings SC 09 November 14e20, 2009,Portland, OR.

Li, J., Liao, W., Choudhary, A., Ross, R., Thakur, R., Gropp, W., Latham, R., Siegel, A.,Gallagher, B., Zingale, M., 2003. Parallel netCDF: a high-performance scientific I/O interface. In: Proceedings of the 2003 ACM/IEEE Conference on Super-computing (SC’03), November 15e21, 2003, Phoenix, Arizona. IEEE ComputerSociety, Washington, DC.

Liou, K.N., 1986. Influence of cirrus clouds on weather and climate processes:a global perspective. Monthly Weather Review 114 (6), 1167e1199.

Lofstead, J., Klasky, S., Schwan, K., Podhorszki, N., Jin, C., 2008. Flexible IO andintegration for scientific codes through the adaptable IO System (ADIOS). In:Proceedings CLADE 2008 at HPDC, June 2008. ACM, Boston, MA.

Lofstead, J., Zheng, F., Klasky, K., Schwan, K., 2009. Adaptable, metadata rich IOmethods for portable high performance IO. In: Proceedings IPDPS’09 May 2009,Rome, Italy.

Lofstead, J., Zheng, F., Liu, Q., Klasky, S., Oldfield, R., Kordenbrock, T., Schwan, K.,Wolf, M., 2010. Managing variability in the IO performance of petascale storagesystems. In: Proceedings of SC 10 November 2010, New Orleans, LA.

Mache, J., Lo, V., Livingstone, M., Garg, S., 1999. The impact of spatial layout of jobs onparallel I/O performance. In: Proceedings of the 6th Workshop on I/O in ParallelandDistributed Systems (IOPADS ’99), Atlanta, GA. ACM,NewYork, NY, pp. 45e56.

Maltrud, M., McClean, J., 2005. An eddy resolving global 1/10� ocean simulation.Ocean Modeling 8, 31e54.

Page 11: Efficient data IO for a Parallel Global Cloud Resolving Model

B. Palmer et al. / Environmental Modelling & Software 26 (2011) 1725e1735 1735

Morton, G.M., 1966. A Computer Oriented Geodetic Data Base and a New Techniquein File Sequencing. IBM Ltd., Ottawa, Ontario.

Neal, J., Fewtrell, T., Bates, P., Wright, N., 2010. A comparison of three parallelizationmethods for 2D flood inundation models. Environmental Modeling and Soft-ware 25, 398e411.

Randall, D.A., Ringler, T.D., Heikes, R.P., Jones, P., Baumgardner, J., 2002. Climatemodeling with spherical geodesic grids. Computing in Science and Engineering4 (5), 32e41.

Randall, D., Khairoutdinov, M., Arakawa, A., Grabowski, W., 2003. Breaking thecloud parameterization deadlock. Bulletin of the American MeteorologicalSociety 84, 1547e1564.

Saini, S., Talcott, D., Thakur, R., Adamadis, P., Rabenseifner, R., Ciotti, R., 2006.Parallel I/O performance characterization of Columbia and NEC SX-8 super-clusters. In: International Parallel and Distributed Processing Symposium(IPDPS) March 26e30, Long Beach, CA.

Thakur, R., Gropp, W., Lusk, E., 1999. Data sieving and collective I/O in ROMIO. In:Proceedings of the 7th Symposium on Frontiers of Massively Parallel Compu-tation, February 1999. IEEE Computer Society, Washington, DC, pp. 82e189.

Von Bloh, W., Rost, S., Gerten, D., Lucht, W., 2010. Efficient parallelization ofa dynamic global vegetation model with river routing. Environmental Modelingand Software 25, 685e690.

Yang, M., Koziol, Q., 2006. Using collective IO inside a high performance IO softwarepackage e HDF5. In: Proceedings of Teragrid 2006: Advancing ScientificDiscovery June 12e15, 2006, Indianapolis, IN.

Yu, D., 2010. Parallelization of a two-dimensional flood inundation model based ondomain decomposition. Environmental Modeling and Software 25, 935e945.

Yu, H., Sahoo, K., Howson, C., AlmásiCastaños, J.G., Gupta, M., Moreira, J.E., Parker, J.J.,Engelsiepen, T.E., Ross, R.B., Thakur, R., Latham, R., Gropp, W.D., February 2006.High performance I/O for the BlueGene/L supercomputer. In: Proceedings of the12th International Symposium on High-Performance Computer Architecture.