An Efﬁcient and Flexible Parallel I/O …...1 An Efﬁcient and Flexible Parallel I/O implementation for the CFD General Notation System Kyle Horne, Nate Benson Center for High Performance

1

An Efficient and Flexible Parallel I/Oimplementation for the CFD General Notation

SystemKyle Horne, Nate Benson

Center for High Performance Computing, Utah State UniversityThomas Hauser

Academic & Research Computing, Northwestern [email protected], [email protected], [email protected]

Abstract—One important, often overlooked, issuefor large, three dimensional time-dependent com-putational fluid dynamics (CFD) simulations is theinput and output performance of the CFD solver,especially for large time-dependent simulations. Thedevelopment of the CFD General Notation System(CGNS) has brought a standardized and robustdata format to the CFD community, enabling theexchange of information between the various stagesof numerical simulations. Application of this standarddata format to large parallel simulations is hinderedby the reliance of most applications on the CGNSMid-Level Library. This library has only supportedserialized I/O. By moving to HDF5 as the recom-mended low level data storage format, the CGNSstandards committee has created the opportunityto support parallel I/O. In our work, we presentthe parallel implementation of the CGNS Mid-LevelLibrary and an I/O request-queuing approach toovercome some limitations of HDF5. We also presentdetailed benchmarks on a parallel file system fortypical structured and unstructured CFD applicationI/O patterns.

I. INTRODUCTION

Linux clusters can provide a viable andmore cost-effective alternative to conventionalsupercomputers for the purposes of computa-tional fluid dynamics (CFD). In some cases,the Linux supercluster is replacing the conven-tional supercomputer as a large-scale, shared-use machine. In other cases, smaller clusters areproviding dedicated platforms for CFD compu-

tations. One important, often overlooked, issuefor large, three-dimensional time-dependentsimulations is the input and output performanceof the CFD solver. The development of theCFD General Notation System (CGNS) (see[1], [2], [3]) has brought a standardized androbust data format to the CFD community,enabling the exchange of information betweenthe various stages of numerical simulations.CGNS is used in many commercial and opensource tools such as Fluent, Fieldview, Plot3D,OpenFOAM, Gmsh and VisIt. Unfortunately,the original design and implementation of theCGNS interface did not consider parallel ap-plications and, therefore, lack parallel accessmechanisms. By moving to HDF5 as the rec-ommended low level data storage format, theCGNS standards committee has created theopportunity to support parallel I/O. In ourwork, we present the parallel implementationof the CGNS Mid-Level Library and detailedbenchmarks on parallel file systems for typicalstructured and unstructured CFD applications.

II. I/O FOR COMPUTATIONAL FLUIDDYNAMICS SIMULATIONS

Current unsteady CFD simulations demandnot only a great deal of processing power, butalso large amounts of data storage. Even arelatively simple unsteady CFD simulation can

2

produce terabytes of information at a tremen-dous rate, and all this data must be stored andarchived for analysis and post processing. MostCFD programs have one of the following threeapproaches implemented:

1) One process is responsible for I/O. Thismeans that all I/O activity is channeledthrough one process and the programcannot take advantage of a parallel filesystem since this process is inherentlyserial. One CFD code using this approachis OVERFLOW [4]developed by NASA.

2) A large number of parallel CFD codesare programmed to access their inputand output data in a one-file-per-processmodel. This approach is simple but hastwo major disadvantages. First, in orderto get access to the full solution for vi-sualization or post-processing, one mustreassemble all files into one solution file.This is an inherently serial process andcan take a long time. Second, the sim-ulation program may only be restartedwith the same number of processes (seeFig. 1a) which may reduce the potentialthroughput for a user on a busy super-computer. One CFD code which takesthis approach is OpenFOAM[5].

3) The most transparent approach for a useris writing from all process in parallelinto one shared file. The current MPI-2 standard [6]defines in chapter nine theMPI-I/O extension to allow parallel ac-cess from memory to files. Parallel I/Ois more difficult to implement and mayperform slower compared to approach 2,when the I/O requests are not coordinatedproperly. An overview of the shared fileI/O model is provided in figure (1 b).

It is highly desirable to develop a set ofparallel APIs for accessing CGNS files thatemploy appropriate parallel I/O techniques.Programming convenience and support forall types of CFD approaches, from block-structured to unstructured, is a very importantdesign criteria, since scientific users may desire

to spend minimal effort on dealing with I/Ooperations. This new parallel I/O API, togetherwith a queuing-approach to overcome somelimitations of HDF5 for a generally distributedmulti-block grid, and the benchmarks on theparallel file system is the core of our storagechallenge entry.

A. The CGNS system

The specific purpose of the CFD GeneralNotation System (CGNS) project is to pro-vide a standard for recording and recoveringcomputer data associated with the numericalsolution of the equations of fluid dynamics.The intent is to facilitate the exchange ofComputational Fluid Dynamics (CFD) data be-tween sites, between applications codes, andacross computing platforms, and to stabilize thearchiving of CFD data.

The CGNS system consists of three parts: (1)the Standard Interface Data Structures, SIDS,(2) the Mid-Level Library (MLL), and (3) thelow-level storage system, currently using ADFand HDF5.

The "Standard Interface Data Structures"specification constitutes the essence of theCGNS system. While the other elements ofthe system deal with software implementationissues, the SIDS specification defines the sub-stance of CGNS. It precisely defines the intel-lectual content of CFD-related data, includingthe organizational structure supporting suchdata and the conventions adopted to standard-ize the data exchange process. The SIDS aredesigned to support all types of informationinvolved in CFD analysis. While the initialtarget was to establish a standard for 3D-structured multi-block compressible Navier-Stokes analysis, the SIDS extensible frameworknow includes unstructured analysis, configura-tions, hybrid topology and geometry-to-meshassociation.

A CGNS file consists of nodes with names,associated meta data and data. Figure 2 showsthe abstract layout of a CGNS file. The nodesof a CGNS tree may or may not be in the

3

Figure 1: Storage topology of a modern computer cluster in two configurations. Configuration(a) shows the traditional CFD approach to I/O, where each node writes a separate file to disk.These files must be combined to obtain the full solution in a process which can be quite timeconsuming. Configuration (b) shows the object of this proposal, a CFD I/O method in which thepartial solution from each node is written into the appropriate location within a single solutionfile for the whole problem.

same file. CGNS provides the concept of links(similar to soft links in a file system) betweennodes. This enables users to write the gridcoordinates in one file and then add links tothe solution written in several files to the gridfile. For a visualization system these linked fileswould look like one big file containing the gridand solution data.

Although the SIDS specification is indepen-dent of the physical file formats, its designwas targeted towards implementation using theADF Core library, but one is able to define amapping to any other data storage format. Cur-rently in CGNS 3.0, HDF5 is the recommendedstorage format, but access to the older ADFformat is transparent to the users of the MLL.

1) ADF data format: The "Advanced DataFormat" (ADF) is a concept defining how datais organized on the storage media. It is basedon a single data structure called an ADF node,designed to store any type of data. Each ADFfile is composed of at least one node called the"root". The ADF nodes follow a hierarchicalarrangement from the root node down. Thisdata format was the main format up to the 3.0

Figure 2: Typical layout of a CGNS file consist-ing of a tree structure with meta data and datastored under each tree node. The dashed linkshows that CGNS also supports the concept oflinks to other nodes which reside in other files.

release.

2) HDF5 data format: In the current betaversion of CGNS 3.0, HDF5 is the default low-level storage format. The format of an HDF5file on disk encompasses several key ideas of

4

the HDF4 and AIO file formats as well asaddresses some shortcomings therein.

The HDF5 library uses these low-level ob-jects to represent the higher-level objects thatare then presented to the user or to applicationsthrough APIs. For instance, a group is an objectheader that contains a message that points toa local heap and to a B-tree which points tosymbol nodes. A data set is an object headerthat contains messages that describe data type,space, layout, filters, external files, fill value,etc. with the layout message pointing to eithera raw data chunk or to a B-tree that points toraw data chunks.

III. PARALLEL I/O FOR THE CGNS SYSTEM

To facilitate convenient and high-performance parallel access to CGNS files,we have defined a new parallel interface andprovide a prototype implementation. Since alarge number of existing users are runningtheir applications over CGNS, our parallelCGNS design retains the original MLL APIand introduces extensions which are minimalchanges from the original API. Currently ourpCGNS library is intended as a companionlibrary to the CGNS Mid-Level Library, sincepCGNS only supports the writing of gridsand data arrays. The much smaller meta datashould still be written by a single process usingthe standard CGNS library. The parallel APIis distinguished from the original serial APIby prefixing the C function calls with “cgp_”instead of “cg_” as in the MLL API. Tabulatedin the appendix are the functions defined inpcgnslib.h which constitute the pCGNS APIas it currently exists. The functions chosento be implemented in the initial version ofthe library were selected on the basis of twocriteria. The first was the need for the functionto provide the most basic functionality of thelibrary. The second criteria was the possiblebenefit from parallelization of the function.The result of using these priorities is that thepCGNS library can write the basic componentsof a CGNS file needed for the standard CGNS

tools to function, but still provides routines forparallel file access on the largest of the datasets that need to be written to the file.

The CGNS Mid-level Library supports writ-ing files in HDF5 as of version 2.5, but thesupport is not native. The main library callsare written to the ADF interface and HDF5support is achieve through an ADF to HDF5compatibility layer. While this solution workswell for serial operation, since ADF has nosupport for parallel I/O, there is no way toaccess HDF5’s parallel abilities through thiscompatibility layer. Therefore the decision wasmade that the new parallel routines would bewritten directly to HDF5 with no connectionsto ADF.

A limited prototype implementation wasdone by Hauser and Pakalapati[7], [8]. In aneffort to improve performance and better in-tegrate the parallel extension in to the CGNSMid-level Library, new routines are being writ-ten which access HDF5 directly and take fulladvantage of the collective I/O support inHDF5 1.8. Much of the work is to providethe basic structures in the file required forthe standard CGNS library to be able to readfiles created by the new extension routines, butthe most important parts are in writing largedata sets describing CFD grids and solutionson those grids in parallel. This is done withfunctions which queue the I/O until a flushfunction is called, at which point the write com-mands are analyzed and executed with HDF5calls. This is done to provide MPI-IO sufficientdata to effectively recognize the collective andcontinuous nature of the data being written.

The parallel capabilities of HDF5 allow asingle array to be accessed by multiple pro-cesses simultaneously. However, it is currentlynot possible to access multiple arrays from asingle process simultaneously. This preventsthe implementation of parallel I/O in pCGNSto allow for the most generic case of partitionand zone allocation.

To resolve this deficiency, the pCGNS libraryimplements an I/O queue. An application can

5

queue I/O operations and then flush them later.This allows the flush routine to optimize theoperations for maximum bandwidth. Currently,the flush routine simply executes the I/O op-erations in the order they were added to thequeue, but based on the results of benchmark-ing the flush routine will be modified to limitthe requests which cause multiple processes toaccess a single array, since that operation isslower than the others.

IV. RELATED WORK

Considerable research has been done on dataaccess for scientific applications. The work hasfocused on data I/O performance and data man-agement convenience. Three projects, MPI-IO, HDF5 and parallel netCDF (PnetCDF) areclosely related to this research.

MPI-IO is a parallel I/O interface specified inthe MPI-2 standard. It is implemented and usedon a wide range of platforms. The most popularimplementation, ROMIO [9] is implementedportably on top of an abstract I/O device layer[10], [11] that enables portability to new under-lying I/O systems. One of the most importantfeatures in ROMIO is collective I/O operations,which adopt a two-phase I/O strategy [12],[13], [14], [15] and improve the parallel I/Operformance by significantly reducing the num-ber of I/O requests that would otherwise resultin many small, noncontiguous I/O requests.However, MPI-IO reads and writes data in araw format without providing any functionalityto effectively manage the associated meta data,nor does it guarantee data portability, therebymaking it inconvenient for scientists to orga-nize, transfer, and share their application data.

HDF is a file format and software, developedat NCSA, for storing, retrieving, analyzing,visualizing, and converting scientific data. Themost popular versions of HDF are HDF4 [16]and HDF5 [17]. Both versions store multidi-mensional arrays together with ancillary datain portable, self-describing file formats. HDF4was designed with serial data access in mind,whereas HDF5 is a major revision in which its

API is completely redesigned and now includesparallel I/O access. The support for paralleldata access in HDF5 is built on top of MPI-IO, which ensures its portability. This moveundoubtedly inconvenienced users of HDF4,but it was a necessary step in providing parallelaccess semantics. HDF5 also adds several newfeatures, such as a hierarchical file structure,that provide application programmers with ahost of options for organizing how data isstored in HDF5 files. Unfortunately this highdegree of flexibility can sometimes come at thecost of high performance, as seen in previousstudies [18], [19].

Parallel-NetCDF [20] is a library that pro-vides high-performance I/O while still main-taining file-format compatibility with Unidata’sNetCDF [21]. In the parallel implementationthe serial netCDF interface was extended tofacilitate parallel access. By building on topof MPI-IO, a number of interface advantagesand performance optimizations were obtained.Preliminary test results show that the some-what simpler netCDF file format coupled witha parallel API combine to provide a veryhigh-performance solution to the problem ofportable, structured data storage.

V. BENCHMARKING SYSTEMS -ARCHITECTURAL DETAILS

The HPC@USU infrastructure consists ofseveral clusters connected through a ProCurve5412zl data center switch to shared NFS andparallel storage. The entire configuration isshown in the appendix. The details of thecompute and storage solutions are described inthe following subsections.

Our Wasatch cluster consists of 64 computenodes. Each compute node has two quad-coreAMD Opteron 2376 running at 2.3 GHz and16 GByte of main memory. The cluster has twointerconnects: A DDR infiniband network con-nects all nodes and is the network for parallelsimulations code, and a network consisting oftwo bonded GBit networks, which connects thecluster directly to the parallel storage system.

6

All clusters at Utah State University’s Centerfor High Performance Computing have accessto the same network-attached storage (NAS)so that researchers can freely utilize all of theavailable computing resources. To accomplishthis, the large ProCurve data center switch isthe core switch connecting all storage and ad-ministrative connections for the clusters. Theseconnections are independent of the message-passing fabric of the clusters, which are uniqueto each machine. The nodes of Wasatch, thenewest addition to our center, are all connectedto the core switch directly. Thus they enjoy thehighest speed access to the NAS.

The NAS system consists of four shelves ofthe Panasas ActiveScale Storage Cluster with20 TB each for a total raw capacity of 80TB and a usable capacity of about 60 TB.Panasas storage devices are network-attachedObject-based Storage Devices (OSDs). EachOSD contains two Serial-ATA drives and suf-ficient intelligence to manage the data that islocally stored. Meta data is managed in a metadata server, a computing node separate fromthe OSDs but residing on the same physicalnetwork. Clients communicate directly with theOSD to access the stored data. Because theOSD manages all low-level layout issues, thereis no need for a file server to intermediateevery data transaction. To provide data redun-dancy and increase I/O throughput, the PanasasActiveScale File System (PanFS) stripes thedata of one file system across a number ofobjects (one per OSD). When PanFS stripesa file across 10 OSDs attached to a GigabitEthernet (GE) network, PanFS can deliver up to400 MB/sec split among the clients accessingthe file.

Processes from our compute nodes can in-voke POSIX read/write function or call throughMPI-I/O or parallel HDF5 built on top ofMPI-I/O. For all of our benchmarks we areusing OpenMPI 1.3.3 with parallel I/O for thePanFS enabled. The parallel extension to theCGNS library, called pCGNS, uses the HDF5library for all file access. HDF5 in turn uses

MPI-I/O for all parallel access to files. Thisabstraction of the file system into layers oflibraries is crucial if a scientific code is to berun at many sites with different hardware andsoftware configurations. In addition to theselibraries, custom test programs are being de-veloped to provide CFD-like I/O requests totest the performance of the pCGNS library.

VI. PARALLEL I/O CHARACTERIZATIONWe ran two benchmarks to test the perfor-

mance of our implementation on current paral-lel file systems: IOR [22] and a self developedparallel CGNS benchmark program which cov-ers most cases of parallel I/O for a structuredor unstructured CFD code. All benchmarkswere run five times and the averages of theseruns together with the standard deviation of theresults are plotted in the figures in this section.

A. IOR benchmark resultsThe IOR benchmarks were run with param-

eters to test the performance achievable on ourbenchmarking system with HDF5. In Fig. 3 theresults for read and write are shown. IOR wasconfigured such that each process access 512MB in a single file on disk using the HDF5interface. These operations were done usingindependent I/O to match the settings used tobenchmark pCGNS.

B. Benchmarks of the parallel CGNS imple-mentation

To test the performance of the new library,benchmark test programs have been writtenwhich make CFD-like I/O requests to the li-brary and measure how long each task takes tocomplete. Five categories of benchmark wereselected to be run on the pCGNS library. Thedata distribution for each of the benchmarkcategories are shown in Figure (4).

Each scenario of data distribution models ageneral method of splitting up data among theprocesses of a parallel program. The scenarioswill be hereafter referred to by the designation

7

� ��

��

��

��

��

��

� �� !"#�$ %&

')(+*-,/.1032543687�9;:=< >@?BAC03254;6D7�9FE125GIHJ>K?LME5NO2PHLQ< RSGUTVN

(a) Data for the entire cluster, scaling up to 8 processes per node

� ��

��

��

��

��

��

� �� !"#$�% &'

(*),+.-0/2143654798�:= ?A@CBD14365

8

on a node forces those processes to share asingle network connection, thus affecting themeasurement of the parallel performance of thelibrary. In the type-a benchmark, however, thislimitation is desired since it allows the singleprocess with multiple zones to have access tothe same storage bandwidth as the aggregatebandwidth of multiple processes on the node.The results are shown in Fig. 5.

2) I/O performance for one not partitionedzone per process: The type-b benchmarks mea-sure the performance of pCGNS in a config-uration very close to the benchmarks run byIOR. In this scenario, each node runs a singleprocess, and each process only writes a singlezone. Because each zone has its own data arrayin the file, and each zone is written by a singleprocess, each process writes to a single dataarray in the file. This benchmark resemblesa structured or unstructured multi-block codewhere each process owns a single zone.

3) I/O performance for multiple not parti-tioned zones per process: The type-c bench-marks test the ability of the library to writemultiple zones from each node in parallel. Be-cause the file size is kept constant, more zonesper node result in smaller data arrays writtento the file in each I/O operation. Since this isexpected to affect performance, multiple casesare run with different numbers of zones pernode. This benchmark resembles a structuredor unstructured multi-block code where eachprocess owns several zones.

4) I/O performance for one partitioned zoneper process with multiple zones: Type-d bench-marks test the ability of the library to write asingle zone from multiple nodes. This causesmultiple processes to access a single data arraysimultaneously, which is quite different frommultiple processes accessing multiple data ar-rays. In this scenario, each node only writesto a single zone, but each zone is written toby multiple processes. This is analogous to aCFD code in which each processes must sharea single zone with another process.

5) I/O performance for multiple partitionedzone per process with multiple zones: The last

scenario, type-e, tests the most general accesspattern, in which multiple processes accesseach zone and multiple zones are accessed byeach process. This access pattern would resultfrom the zone definitions and parallel partition-ing processes being completely independentone from another. Thus, this is the most generalof the benchmarks and resembles a CFD codein which each process must share several zoneswith other processes.

C. Discussion

It can be observed from the benchmarks ofpCGNS that the write process is very depen-dent on the amount of data being transferred.Using larger data sets results in significantlyimproved performance when writing to the file,but does not affect the read performance. Thisphenomenon can be explained by the extraoperations that must be executed in the caseof writing to the file, such as disk allocationfor the new data, etc. In the case of larger datasets, the overhead is much smaller compared tothe time required to transfer the data itself, andresults in better performance. Read operationshave no need for disk allocation, and are not af-fected. This also explains the performance seenin the type-c benchmarks, since the addition ofmore zones for each node results in more diskallocations.

The optimal performance of the pCGNSlibrary is seen in the type-b benchmark. Whenrunning on 24 nodes, the code writes the dataat more than 450 MB/s, which compares veryfavorably with the 500 MB/s seen in the IORbenchmark for the same number of processesand nodes. The variance between pCGNS andIOR is not large, and is likely caused by betterorganization of the I/O and disk allocation inthe IOR benchmark, since pCGNS does notknow ahead of time what the I/O and data willlook like. The lesser performance of pCGNSwith larger numbers of nodes is likely due tothe decreasing size of the data per process,whereas IOR keeps the data per process con-stant.

9

� � � � � � � � ��

��

� ��

� ��

��

��

� ��

��

� �� !"#�$ %&

')(+*-,/.10325476�8:9

;=�23?A@CB�(EDGFIHKJ;=�23?A@CB�(EDLH5MKN;=�23?A@CB�(EDOMEF/H;=�23?A@CB�(EDGF/P)HRQ

(a) Read bandwidth.

� � � � � � � � ��

��

� �

��

��

� ��

��

� ��

� ��

��

� �� !� "#$%�& '(

)+*-,/.1032547698�:4@?�45ACBED�*GFIHKJML=>4@?�45ACBED�*GFNJ7OMP=>4@?�45ACBED�*GFQOGH1J=>4@?�45ACBED�*GFIH1R+JTS

(b) Write bandwidth.

Figure 5: Multiple zones on one single node with up to eight cores.

� ��

��

��

��

��

��

��

��

��

��

��

� �� !"#$�% &'

(*),+.-�/1032,465�78(9:2@?BA�)DCFE.G*HJI9:2@?BA�)DCKH1GJI,L9:2@?BA�)DCMI,G1NPO

(a) Read bandwidth

� ��

��

�

��

��

��

��

��

��

� �� ! "#

$&%('*)�+-,/.(021�34$56.879./:=�%@?BA*C&DFE56.879./:=�%@?GD-CFE(H56.879./:=�%@?IE(C-JLK

(b) Write bandwidth

Figure 6: I/O performance for one non-partitioned zone per process

The read performance of pCGNS does notcompare as favorably with the IOR benchmarkas the write performance. The peak read per-formance from pCGNS is seen with 16 nodes,where it achieves a read bandwidth of ~1100MB/s, whereas IOR sees ~1600 MB/s. Afterthis point, pCGNS’s performance drops offwhile IOR’s continues to improve up to a peakof 2500 MB/s. The post-peak performance isagain due to the increasing size of data withIOR and constant data size with pCGNS. Thediscrepancy in the range of lower node counts

is more indicative of the performance differ-ence between IOR and pCGNS. This differencemay be caused by the fact that IOR runs asingle read request against the storage, whereasthe pCGNS benchmark runs seven. With datarates so high compared to the size of the filesused, the difference due to added requests bypCGNS may be the culprit responsible for theperformance drop.

10

� ��

��

�

��

��

��

��

��

��

��

��

��

� �� !"#$�% &'

(*),+�-�.0/214365�798;:=0>A@CB

DE1?FG1IH,JK

11

� ��

��

�

��

��

��

��

��

��

��

��

��

� �� !"#$�% &'

(*),+.-�/1032,465�798;:=*>A@CB

DE2GFH23IKJL

12

� ��

��

��

��

��

��

��

��

��

��

��

��

� �� !� "#$%�& '(

)+*-,�.�/+0214365�798:;1=12?-@BA�*DCFEHG+IKJ:;1=12?-@BA�*DCLIMGKJON:;1=12?-@BA�*DCPJOGMQ4R

(a) Read bandwidth

� ��

��

��

��

��

��

��

��

��

��

� �� !�" #$

%'&)(�*�+',.-0/21�35467-98:-.;)=�&@?BADC'EGF67-98:-.;)=�&@?HEICGFKJ67-98:-.;)=�&@?LFKCIM0N

(b) Write bandwidth

Figure 11: I/O performance for two partitioned zones per process

[5] Weller, H., Tabor, G., Jasak, H., and Fureby, C., “A ten-sorial approach to computational continuum mechanicsusing object orientied techniques,” Computers in Physics,Vol. 12, No. 6, 1998, pp. 620 – 631. 2

[6] Interface, M. P., “MPI-2: Extensions to the Message-Passing Interface,” 1996. 2

[7] Hauser, T., “Parallel I/O for the CGNS system,” AMERI-CAN INSTITUTE OF AERONAUTICS AND ASTRONAU-TICS, , No. 1090, 2004. 4

[8] Parimala D. Pakalapati, T. H., “Benchmarking Paral-lel I/O Performance for Computational Fluid DynamicsApplications,” AIAA Aerospace Sciences Meeting andExhibit, Vol. 43, 2005. 4

[9] Thakur, R., Ross, R., Lusk, E., and Gropp, W., “UsersGuide for ROMIO: A High-Performance, Portable MPI-IO Implementation,” Tech. Rep. Technical MemorandumNo. 234, Mathematics and Computer Science Division,Argonne National Laboratory, Revised January 2002. 5

[10] Thakur, R., Gropp, W., , and Lusk, E., “An Abstract-Device interface for Implementing Portable Parallel-I/OInterfaces(ADIO),” Proceedings of the 6th Symposium onthe Frontiers of Massively Parallel Computation, October1996, pp. 180–187. 5

[11] Thakur, R., Gropp, W., , and Lusk, E., “On ImplementingMPI-I/O Portably and with High Performance,” Proceed-ings of the Sixth Workshop on Input/Output in Paralleland Distributed Systems, May 1999, pp. 23–32. 5

[12] Rosario, J., Bordawekar, R., , and Choudhary, A., “Im-proved Parallel I/O via a Two-Phase Run-time AccessStrategy,” IPPS ’93 Parallel I/O Workshop, February1993. 5

[13] Thakur, R., Bordawekar, R., Choudhary, A., and Pon-nusamy, R., “PASSION Runtime Library for ParallelI/O,” Scalable Parallel Libraries Conference, October1994. 5

[14] Thakur, R. and Choudhary, A., “An Extended Two-PhaseMethod for Accessing Sections of Out-of-Core Arrays,”

Scientific Programming, Vol. 5, No. 4, 1996, pp. 301–317. 5

[15] Thakur, R., Gropp, W., and Lusk, E., “Data Sieving andCollective I/O in ROMIO,” Proceeding of the 7th Sympo-sium on the Frontiers of Massively Parallel Computation,February 1999, pp. 182–189. 5

[16] “HDF4 Home Page,” The National Center for Supercom-puting Applications. http://hdf.ncsa.uiuc.edu/hdf4.html. 5

[17] “HDF5 Home Page, The National Center for Supercom-puting Applications,” http://hdf.ncsa.uiuc.edu/HDF5/. 5

[18] Li, J., Liao, W., Choudhary, A., and Taylor, V., “I/OAnalysis and Optimization for an AMR Cosmology Ap-plication,” Proceedings of IEEE Cluster 2002, Chicago,IL, September 2002. 5

[19] Ross, R., Nurmi, D., Cheng, A., and Zingale, M., “A CaseStudy in Application I/O on Linux Clusters,” Proceedingsof SC2001, Denver, CO, November 2001. 5

[20] Li, J., keng Liao, W., Choudhary, A., Ross, R., Thakur,R., Gropp, W., Latham, R., and Siegel, A., “ParallelnetCDF: A High-Performance Scientific I/O Interface,”Proceedings of the Supercomputering 2003 Conference,Phoenix, AZ, November 2003. 5

[21] Rew, R. and Davis, G., “The Unidata netCDF: Soft-ware for Scientific Data Access,” Sixth InternationalConference on Interactive Information and ProcessingSystems for Meteorology, Oceanography and Hydrology,Anaheim, CA, February 1990. 5

[22] “IOR HPC Benchmark,” Soureforge http://sourceforge.net/projects/ior-sio/. 6

http:// hdf.ncsa.uiuc.edu/hdf4.htmlhttp://hdf.ncsa.uiuc.edu/HDF5/http://sourceforge.net/projects/ior-sio/http://sourceforge.net/projects/ior-sio/

w w w . h p c . u s u . e d u

An Efficient and Flexible Parallel I/O implementa


•  CGNS •  Parallel I/O for CGNS •  Benchmarking system •  Benchmarking results –  IOR – Parallel CGNS

Outline


What is CGNS ? •  CFD General Nota


What is CGNS ? •  Standard Interface Data Structures (SIDS)

–  Collec


CGNS


I/O Needs on Parallel Computers

•  High Performance –  Take advantage of parallel I/O paths (when available) –  Support for applica


•  New parallel interface •  Perform I/O collec


Parallel CGNS API


Benchmarking environment Wasatch

...Node 1

Node 2

Node 64

InfinibandM

yrinet

Node 1

Node 32

...Node 33

Node 64

...

Uinta

1 Gbit

1 Gbit

/home

/opt

/panfs

Data CenterSwitch


IOR results


Benchmarks for pCGNS


Reading one par


Wri


Reading mul


Wri


•  Created a parallel library for CGNS data I/O •  Cover all I/O paaerns supported for mul

13

APPENDIX

14

Root node

CGNSBase_tCGNSLibraryVersion_t

Zone 1 Zone 2 ConvergenceHistory_t ReferenceState_t

GridCoordinates_tElements_t FlowSolution_t ZoneGridConnectivity_t

Zone_t Zone_t

ElementConnectivity CoordinateYDataArray_t DataArray_t

GridLocation Density PressureGridLocation_t DataArray_t DataArray_t

ZoneBC_t

CoordinateXDataArray_t

Figure 12: Example CGNS file with zone being the node containing grid and solution information.

Table I: API of the pCGNS library as implemented in pcgnslib.c and declared in pcgnslib.h.Software using the library to access CGNS files in parallel must use the routines listed here.

General File Operationscgp_open Open a new file in parallelcgp_base_read Read the details of a base in the filecgp_base_write Write a new base to the filecgp_nbases Return the number of bases in the filecgp_zone_read Read the details of a zone in the basecgp_zone_type Read the type of a zone in the basecgp_zone_write Write a zone to the basecgp_nzones Return the number of zones in the base

Coordinate Data Operationscgp_coord_write Create the node and empty array to store coordinate datacgp_coord_write_data Write coordinate data to the zone in parallel

Unstructured Grid Connectivity Operationscgp_section_write Create the nodes and empty array to store grid connectivity for an unstructured

meshcgp_section_write_data Write the grid connectivity to the zone in parallel for an unstructured mesh

Solution Data Operationscgp_sol_write Create the node and empty array to store solution datacgp_sol_write_data Write solution data to the zone in parallel

General Array Operationscgp_array_write Create the node and empty array to store general array datacgp_array_write_data Write general array data to the zone in parallel

Queued I/O Operationsqueue_slice_write Queue a write operation to be executed laterqueue_flush Execute queued write operations

15

Wasatch

...Node 1

Node 2

Node 64

InfinibandM

yrinet

Node 1

Node 32

...Node 33

Node 64

...

Uinta

1 Gbit

1 Gbit

/home

/opt

/panfs

Data CenterSwitch

Figure 13: Topology of cluster’s network attached storage at Utah State University. The oldercluster Uinta is connected to the root switch through either one or two gigabit switches, whereasall nodes in the new cluster Wasatch are connected directly to the root switch. All the storage isconnected directly to the root switch, including a parallel storage solution from Panasas mountedon /panfs.

I IntroductionII I/O for Computational Fluid Dynamics SimulationsII-A The CGNS systemII-A1 ADF data formatII-A2 HDF5 data format

III Parallel I/O for the CGNS systemIV Related WorkV Benchmarking Systems - Architectural DetailsVI Parallel I/O Characterization VI-A IOR benchmark resultsVI-B Benchmarks of the parallel CGNS implementationVI-B1 I/O performance for a single node with multiple zonesVI-B2 I/O performance for one not partitioned zone per processVI-B3 I/O performance for multiple not partitioned zones per processVI-B4 I/O performance for one partitioned zone per process with multiple zonesVI-B5 I/O performance for multiple partitioned zone per process with multiple zones

VI-C Discussion

VII SummaryVIII AcknowledgementsReferencesAppendix

Documents

An Efﬁcient and Flexible Parallel I/O …...1 An Efﬁcient and Flexible Parallel I/O implementation for the CFD General Notation System Kyle Horne, Nate Benson Center for High Performance