31
1 An Efficient and Flexible Parallel I/O implementation for the CFD General Notation System Kyle Horne, Nate Benson Center for High Performance Computing, Utah State University Thomas Hauser Academic & Research Computing, Northwestern University [email protected], [email protected], [email protected] Abstract—One important, often overlooked, issue for large, three dimensional time-dependent com- putational fluid dynamics (CFD) simulations is the input and output performance of the CFD solver, especially for large time-dependent simulations. The development of the CFD General Notation System (CGNS) has brought a standardized and robust data format to the CFD community, enabling the exchange of information between the various stages of numerical simulations. Application of this standard data format to large parallel simulations is hindered by the reliance of most applications on the CGNS Mid-Level Library. This library has only supported serialized I/O. By moving to HDF5 as the recom- mended low level data storage format, the CGNS standards committee has created the opportunity to support parallel I/O. In our work, we present the parallel implementation of the CGNS Mid-Level Library and an I/O request-queuing approach to overcome some limitations of HDF5. We also present detailed benchmarks on a parallel file system for typical structured and unstructured CFD application I/O patterns. I. I NTRODUCTION Linux clusters can provide a viable and more cost-effective alternative to conventional supercomputers for the purposes of computa- tional fluid dynamics (CFD). In some cases, the Linux supercluster is replacing the conven- tional supercomputer as a large-scale, shared- use machine. In other cases, smaller clusters are providing dedicated platforms for CFD compu- tations. One important, often overlooked, issue for large, three-dimensional time-dependent simulations is the input and output performance of the CFD solver. The development of the CFD General Notation System (CGNS) (see [1], [2], [3]) has brought a standardized and robust data format to the CFD community, enabling the exchange of information between the various stages of numerical simulations. CGNS is used in many commercial and open source tools such as Fluent, Fieldview, Plot3D, OpenFOAM, Gmsh and VisIt. Unfortunately, the original design and implementation of the CGNS interface did not consider parallel ap- plications and, therefore, lack parallel access mechanisms. By moving to HDF5 as the rec- ommended low level data storage format, the CGNS standards committee has created the opportunity to support parallel I/O. In our work, we present the parallel implementation of the CGNS Mid-Level Library and detailed benchmarks on parallel file systems for typical structured and unstructured CFD applications. II. I/O FOR COMPUTATIONAL F LUID DYNAMICS S IMULATIONS Current unsteady CFD simulations demand not only a great deal of processing power, but also large amounts of data storage. Even a relatively simple unsteady CFD simulation can

An Efficient and Flexible Parallel I/O …...1 An Efficient and Flexible Parallel I/O implementation for the CFD General Notation System Kyle Horne, Nate Benson Center for High Performance

  • Upload
    others

  • View
    68

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    An Efficient and Flexible Parallel I/Oimplementation for the CFD General Notation

    SystemKyle Horne, Nate Benson

    Center for High Performance Computing, Utah State UniversityThomas Hauser

    Academic & Research Computing, Northwestern [email protected], [email protected], [email protected]

    Abstract—One important, often overlooked, issuefor large, three dimensional time-dependent com-putational fluid dynamics (CFD) simulations is theinput and output performance of the CFD solver,especially for large time-dependent simulations. Thedevelopment of the CFD General Notation System(CGNS) has brought a standardized and robustdata format to the CFD community, enabling theexchange of information between the various stagesof numerical simulations. Application of this standarddata format to large parallel simulations is hinderedby the reliance of most applications on the CGNSMid-Level Library. This library has only supportedserialized I/O. By moving to HDF5 as the recom-mended low level data storage format, the CGNSstandards committee has created the opportunityto support parallel I/O. In our work, we presentthe parallel implementation of the CGNS Mid-LevelLibrary and an I/O request-queuing approach toovercome some limitations of HDF5. We also presentdetailed benchmarks on a parallel file system fortypical structured and unstructured CFD applicationI/O patterns.

    I. INTRODUCTION

    Linux clusters can provide a viable andmore cost-effective alternative to conventionalsupercomputers for the purposes of computa-tional fluid dynamics (CFD). In some cases,the Linux supercluster is replacing the conven-tional supercomputer as a large-scale, shared-use machine. In other cases, smaller clusters areproviding dedicated platforms for CFD compu-

    tations. One important, often overlooked, issuefor large, three-dimensional time-dependentsimulations is the input and output performanceof the CFD solver. The development of theCFD General Notation System (CGNS) (see[1], [2], [3]) has brought a standardized androbust data format to the CFD community,enabling the exchange of information betweenthe various stages of numerical simulations.CGNS is used in many commercial and opensource tools such as Fluent, Fieldview, Plot3D,OpenFOAM, Gmsh and VisIt. Unfortunately,the original design and implementation of theCGNS interface did not consider parallel ap-plications and, therefore, lack parallel accessmechanisms. By moving to HDF5 as the rec-ommended low level data storage format, theCGNS standards committee has created theopportunity to support parallel I/O. In ourwork, we present the parallel implementationof the CGNS Mid-Level Library and detailedbenchmarks on parallel file systems for typicalstructured and unstructured CFD applications.

    II. I/O FOR COMPUTATIONAL FLUIDDYNAMICS SIMULATIONS

    Current unsteady CFD simulations demandnot only a great deal of processing power, butalso large amounts of data storage. Even arelatively simple unsteady CFD simulation can

  • 2

    produce terabytes of information at a tremen-dous rate, and all this data must be stored andarchived for analysis and post processing. MostCFD programs have one of the following threeapproaches implemented:

    1) One process is responsible for I/O. Thismeans that all I/O activity is channeledthrough one process and the programcannot take advantage of a parallel filesystem since this process is inherentlyserial. One CFD code using this approachis OVERFLOW [4]developed by NASA.

    2) A large number of parallel CFD codesare programmed to access their inputand output data in a one-file-per-processmodel. This approach is simple but hastwo major disadvantages. First, in orderto get access to the full solution for vi-sualization or post-processing, one mustreassemble all files into one solution file.This is an inherently serial process andcan take a long time. Second, the sim-ulation program may only be restartedwith the same number of processes (seeFig. 1a) which may reduce the potentialthroughput for a user on a busy super-computer. One CFD code which takesthis approach is OpenFOAM[5].

    3) The most transparent approach for a useris writing from all process in parallelinto one shared file. The current MPI-2 standard [6]defines in chapter nine theMPI-I/O extension to allow parallel ac-cess from memory to files. Parallel I/Ois more difficult to implement and mayperform slower compared to approach 2,when the I/O requests are not coordinatedproperly. An overview of the shared fileI/O model is provided in figure (1 b).

    It is highly desirable to develop a set ofparallel APIs for accessing CGNS files thatemploy appropriate parallel I/O techniques.Programming convenience and support forall types of CFD approaches, from block-structured to unstructured, is a very importantdesign criteria, since scientific users may desire

    to spend minimal effort on dealing with I/Ooperations. This new parallel I/O API, togetherwith a queuing-approach to overcome somelimitations of HDF5 for a generally distributedmulti-block grid, and the benchmarks on theparallel file system is the core of our storagechallenge entry.

    A. The CGNS system

    The specific purpose of the CFD GeneralNotation System (CGNS) project is to pro-vide a standard for recording and recoveringcomputer data associated with the numericalsolution of the equations of fluid dynamics.The intent is to facilitate the exchange ofComputational Fluid Dynamics (CFD) data be-tween sites, between applications codes, andacross computing platforms, and to stabilize thearchiving of CFD data.

    The CGNS system consists of three parts: (1)the Standard Interface Data Structures, SIDS,(2) the Mid-Level Library (MLL), and (3) thelow-level storage system, currently using ADFand HDF5.

    The "Standard Interface Data Structures"specification constitutes the essence of theCGNS system. While the other elements ofthe system deal with software implementationissues, the SIDS specification defines the sub-stance of CGNS. It precisely defines the intel-lectual content of CFD-related data, includingthe organizational structure supporting suchdata and the conventions adopted to standard-ize the data exchange process. The SIDS aredesigned to support all types of informationinvolved in CFD analysis. While the initialtarget was to establish a standard for 3D-structured multi-block compressible Navier-Stokes analysis, the SIDS extensible frameworknow includes unstructured analysis, configura-tions, hybrid topology and geometry-to-meshassociation.

    A CGNS file consists of nodes with names,associated meta data and data. Figure 2 showsthe abstract layout of a CGNS file. The nodesof a CGNS tree may or may not be in the

  • 3

    Figure 1: Storage topology of a modern computer cluster in two configurations. Configuration(a) shows the traditional CFD approach to I/O, where each node writes a separate file to disk.These files must be combined to obtain the full solution in a process which can be quite timeconsuming. Configuration (b) shows the object of this proposal, a CFD I/O method in which thepartial solution from each node is written into the appropriate location within a single solutionfile for the whole problem.

    same file. CGNS provides the concept of links(similar to soft links in a file system) betweennodes. This enables users to write the gridcoordinates in one file and then add links tothe solution written in several files to the gridfile. For a visualization system these linked fileswould look like one big file containing the gridand solution data.

    Although the SIDS specification is indepen-dent of the physical file formats, its designwas targeted towards implementation using theADF Core library, but one is able to define amapping to any other data storage format. Cur-rently in CGNS 3.0, HDF5 is the recommendedstorage format, but access to the older ADFformat is transparent to the users of the MLL.

    1) ADF data format: The "Advanced DataFormat" (ADF) is a concept defining how datais organized on the storage media. It is basedon a single data structure called an ADF node,designed to store any type of data. Each ADFfile is composed of at least one node called the"root". The ADF nodes follow a hierarchicalarrangement from the root node down. Thisdata format was the main format up to the 3.0

    Figure 2: Typical layout of a CGNS file consist-ing of a tree structure with meta data and datastored under each tree node. The dashed linkshows that CGNS also supports the concept oflinks to other nodes which reside in other files.

    release.

    2) HDF5 data format: In the current betaversion of CGNS 3.0, HDF5 is the default low-level storage format. The format of an HDF5file on disk encompasses several key ideas of

  • 4

    the HDF4 and AIO file formats as well asaddresses some shortcomings therein.

    The HDF5 library uses these low-level ob-jects to represent the higher-level objects thatare then presented to the user or to applicationsthrough APIs. For instance, a group is an objectheader that contains a message that points toa local heap and to a B-tree which points tosymbol nodes. A data set is an object headerthat contains messages that describe data type,space, layout, filters, external files, fill value,etc. with the layout message pointing to eithera raw data chunk or to a B-tree that points toraw data chunks.

    III. PARALLEL I/O FOR THE CGNS SYSTEM

    To facilitate convenient and high-performance parallel access to CGNS files,we have defined a new parallel interface andprovide a prototype implementation. Since alarge number of existing users are runningtheir applications over CGNS, our parallelCGNS design retains the original MLL APIand introduces extensions which are minimalchanges from the original API. Currently ourpCGNS library is intended as a companionlibrary to the CGNS Mid-Level Library, sincepCGNS only supports the writing of gridsand data arrays. The much smaller meta datashould still be written by a single process usingthe standard CGNS library. The parallel APIis distinguished from the original serial APIby prefixing the C function calls with “cgp_”instead of “cg_” as in the MLL API. Tabulatedin the appendix are the functions defined inpcgnslib.h which constitute the pCGNS APIas it currently exists. The functions chosento be implemented in the initial version ofthe library were selected on the basis of twocriteria. The first was the need for the functionto provide the most basic functionality of thelibrary. The second criteria was the possiblebenefit from parallelization of the function.The result of using these priorities is that thepCGNS library can write the basic componentsof a CGNS file needed for the standard CGNS

    tools to function, but still provides routines forparallel file access on the largest of the datasets that need to be written to the file.

    The CGNS Mid-level Library supports writ-ing files in HDF5 as of version 2.5, but thesupport is not native. The main library callsare written to the ADF interface and HDF5support is achieve through an ADF to HDF5compatibility layer. While this solution workswell for serial operation, since ADF has nosupport for parallel I/O, there is no way toaccess HDF5’s parallel abilities through thiscompatibility layer. Therefore the decision wasmade that the new parallel routines would bewritten directly to HDF5 with no connectionsto ADF.

    A limited prototype implementation wasdone by Hauser and Pakalapati[7], [8]. In aneffort to improve performance and better in-tegrate the parallel extension in to the CGNSMid-level Library, new routines are being writ-ten which access HDF5 directly and take fulladvantage of the collective I/O support inHDF5 1.8. Much of the work is to providethe basic structures in the file required forthe standard CGNS library to be able to readfiles created by the new extension routines, butthe most important parts are in writing largedata sets describing CFD grids and solutionson those grids in parallel. This is done withfunctions which queue the I/O until a flushfunction is called, at which point the write com-mands are analyzed and executed with HDF5calls. This is done to provide MPI-IO sufficientdata to effectively recognize the collective andcontinuous nature of the data being written.

    The parallel capabilities of HDF5 allow asingle array to be accessed by multiple pro-cesses simultaneously. However, it is currentlynot possible to access multiple arrays from asingle process simultaneously. This preventsthe implementation of parallel I/O in pCGNSto allow for the most generic case of partitionand zone allocation.

    To resolve this deficiency, the pCGNS libraryimplements an I/O queue. An application can

  • 5

    queue I/O operations and then flush them later.This allows the flush routine to optimize theoperations for maximum bandwidth. Currently,the flush routine simply executes the I/O op-erations in the order they were added to thequeue, but based on the results of benchmark-ing the flush routine will be modified to limitthe requests which cause multiple processes toaccess a single array, since that operation isslower than the others.

    IV. RELATED WORK

    Considerable research has been done on dataaccess for scientific applications. The work hasfocused on data I/O performance and data man-agement convenience. Three projects, MPI-IO, HDF5 and parallel netCDF (PnetCDF) areclosely related to this research.

    MPI-IO is a parallel I/O interface specified inthe MPI-2 standard. It is implemented and usedon a wide range of platforms. The most popularimplementation, ROMIO [9] is implementedportably on top of an abstract I/O device layer[10], [11] that enables portability to new under-lying I/O systems. One of the most importantfeatures in ROMIO is collective I/O operations,which adopt a two-phase I/O strategy [12],[13], [14], [15] and improve the parallel I/Operformance by significantly reducing the num-ber of I/O requests that would otherwise resultin many small, noncontiguous I/O requests.However, MPI-IO reads and writes data in araw format without providing any functionalityto effectively manage the associated meta data,nor does it guarantee data portability, therebymaking it inconvenient for scientists to orga-nize, transfer, and share their application data.

    HDF is a file format and software, developedat NCSA, for storing, retrieving, analyzing,visualizing, and converting scientific data. Themost popular versions of HDF are HDF4 [16]and HDF5 [17]. Both versions store multidi-mensional arrays together with ancillary datain portable, self-describing file formats. HDF4was designed with serial data access in mind,whereas HDF5 is a major revision in which its

    API is completely redesigned and now includesparallel I/O access. The support for paralleldata access in HDF5 is built on top of MPI-IO, which ensures its portability. This moveundoubtedly inconvenienced users of HDF4,but it was a necessary step in providing parallelaccess semantics. HDF5 also adds several newfeatures, such as a hierarchical file structure,that provide application programmers with ahost of options for organizing how data isstored in HDF5 files. Unfortunately this highdegree of flexibility can sometimes come at thecost of high performance, as seen in previousstudies [18], [19].

    Parallel-NetCDF [20] is a library that pro-vides high-performance I/O while still main-taining file-format compatibility with Unidata’sNetCDF [21]. In the parallel implementationthe serial netCDF interface was extended tofacilitate parallel access. By building on topof MPI-IO, a number of interface advantagesand performance optimizations were obtained.Preliminary test results show that the some-what simpler netCDF file format coupled witha parallel API combine to provide a veryhigh-performance solution to the problem ofportable, structured data storage.

    V. BENCHMARKING SYSTEMS -ARCHITECTURAL DETAILS

    The HPC@USU infrastructure consists ofseveral clusters connected through a ProCurve5412zl data center switch to shared NFS andparallel storage. The entire configuration isshown in the appendix. The details of thecompute and storage solutions are described inthe following subsections.

    Our Wasatch cluster consists of 64 computenodes. Each compute node has two quad-coreAMD Opteron 2376 running at 2.3 GHz and16 GByte of main memory. The cluster has twointerconnects: A DDR infiniband network con-nects all nodes and is the network for parallelsimulations code, and a network consisting oftwo bonded GBit networks, which connects thecluster directly to the parallel storage system.

  • 6

    All clusters at Utah State University’s Centerfor High Performance Computing have accessto the same network-attached storage (NAS)so that researchers can freely utilize all of theavailable computing resources. To accomplishthis, the large ProCurve data center switch isthe core switch connecting all storage and ad-ministrative connections for the clusters. Theseconnections are independent of the message-passing fabric of the clusters, which are uniqueto each machine. The nodes of Wasatch, thenewest addition to our center, are all connectedto the core switch directly. Thus they enjoy thehighest speed access to the NAS.

    The NAS system consists of four shelves ofthe Panasas ActiveScale Storage Cluster with20 TB each for a total raw capacity of 80TB and a usable capacity of about 60 TB.Panasas storage devices are network-attachedObject-based Storage Devices (OSDs). EachOSD contains two Serial-ATA drives and suf-ficient intelligence to manage the data that islocally stored. Meta data is managed in a metadata server, a computing node separate fromthe OSDs but residing on the same physicalnetwork. Clients communicate directly with theOSD to access the stored data. Because theOSD manages all low-level layout issues, thereis no need for a file server to intermediateevery data transaction. To provide data redun-dancy and increase I/O throughput, the PanasasActiveScale File System (PanFS) stripes thedata of one file system across a number ofobjects (one per OSD). When PanFS stripesa file across 10 OSDs attached to a GigabitEthernet (GE) network, PanFS can deliver up to400 MB/sec split among the clients accessingthe file.

    Processes from our compute nodes can in-voke POSIX read/write function or call throughMPI-I/O or parallel HDF5 built on top ofMPI-I/O. For all of our benchmarks we areusing OpenMPI 1.3.3 with parallel I/O for thePanFS enabled. The parallel extension to theCGNS library, called pCGNS, uses the HDF5library for all file access. HDF5 in turn uses

    MPI-I/O for all parallel access to files. Thisabstraction of the file system into layers oflibraries is crucial if a scientific code is to berun at many sites with different hardware andsoftware configurations. In addition to theselibraries, custom test programs are being de-veloped to provide CFD-like I/O requests totest the performance of the pCGNS library.

    VI. PARALLEL I/O CHARACTERIZATIONWe ran two benchmarks to test the perfor-

    mance of our implementation on current paral-lel file systems: IOR [22] and a self developedparallel CGNS benchmark program which cov-ers most cases of parallel I/O for a structuredor unstructured CFD code. All benchmarkswere run five times and the averages of theseruns together with the standard deviation of theresults are plotted in the figures in this section.

    A. IOR benchmark resultsThe IOR benchmarks were run with param-

    eters to test the performance achievable on ourbenchmarking system with HDF5. In Fig. 3 theresults for read and write are shown. IOR wasconfigured such that each process access 512MB in a single file on disk using the HDF5interface. These operations were done usingindependent I/O to match the settings used tobenchmark pCGNS.

    B. Benchmarks of the parallel CGNS imple-mentation

    To test the performance of the new library,benchmark test programs have been writtenwhich make CFD-like I/O requests to the li-brary and measure how long each task takes tocomplete. Five categories of benchmark wereselected to be run on the pCGNS library. Thedata distribution for each of the benchmarkcategories are shown in Figure (4).

    Each scenario of data distribution models ageneral method of splitting up data among theprocesses of a parallel program. The scenarioswill be hereafter referred to by the designation

  • 7

    � ����� ����� ����� ��� ����� ����������������������

    �����

    �������

    �������

    �������

    �������

    � ������� !"#�$ %&

    ')(+*-,/.1032543687�9;:=< >@?BAC03254;6D7�9FE125GIHJ>K?LME5NO2PHLQ< RSGUTVN

    (a) Data for the entire cluster, scaling up to 8 processes per node

    � ��� ��� ��� ��� � ��� ������������������

    ���

    �������

    �����

    �������

    ������

    � ����� � !"#$�% &'

    (*),+.-0/2143654798�:= ?A@CBD14365

  • 8

    on a node forces those processes to share asingle network connection, thus affecting themeasurement of the parallel performance of thelibrary. In the type-a benchmark, however, thislimitation is desired since it allows the singleprocess with multiple zones to have access tothe same storage bandwidth as the aggregatebandwidth of multiple processes on the node.The results are shown in Fig. 5.

    2) I/O performance for one not partitionedzone per process: The type-b benchmarks mea-sure the performance of pCGNS in a config-uration very close to the benchmarks run byIOR. In this scenario, each node runs a singleprocess, and each process only writes a singlezone. Because each zone has its own data arrayin the file, and each zone is written by a singleprocess, each process writes to a single dataarray in the file. This benchmark resemblesa structured or unstructured multi-block codewhere each process owns a single zone.

    3) I/O performance for multiple not parti-tioned zones per process: The type-c bench-marks test the ability of the library to writemultiple zones from each node in parallel. Be-cause the file size is kept constant, more zonesper node result in smaller data arrays writtento the file in each I/O operation. Since this isexpected to affect performance, multiple casesare run with different numbers of zones pernode. This benchmark resembles a structuredor unstructured multi-block code where eachprocess owns several zones.

    4) I/O performance for one partitioned zoneper process with multiple zones: Type-d bench-marks test the ability of the library to write asingle zone from multiple nodes. This causesmultiple processes to access a single data arraysimultaneously, which is quite different frommultiple processes accessing multiple data ar-rays. In this scenario, each node only writesto a single zone, but each zone is written toby multiple processes. This is analogous to aCFD code in which each processes must sharea single zone with another process.

    5) I/O performance for multiple partitionedzone per process with multiple zones: The last

    scenario, type-e, tests the most general accesspattern, in which multiple processes accesseach zone and multiple zones are accessed byeach process. This access pattern would resultfrom the zone definitions and parallel partition-ing processes being completely independentone from another. Thus, this is the most generalof the benchmarks and resembles a CFD codein which each process must share several zoneswith other processes.

    C. Discussion

    It can be observed from the benchmarks ofpCGNS that the write process is very depen-dent on the amount of data being transferred.Using larger data sets results in significantlyimproved performance when writing to the file,but does not affect the read performance. Thisphenomenon can be explained by the extraoperations that must be executed in the caseof writing to the file, such as disk allocationfor the new data, etc. In the case of larger datasets, the overhead is much smaller compared tothe time required to transfer the data itself, andresults in better performance. Read operationshave no need for disk allocation, and are not af-fected. This also explains the performance seenin the type-c benchmarks, since the addition ofmore zones for each node results in more diskallocations.

    The optimal performance of the pCGNSlibrary is seen in the type-b benchmark. Whenrunning on 24 nodes, the code writes the dataat more than 450 MB/s, which compares veryfavorably with the 500 MB/s seen in the IORbenchmark for the same number of processesand nodes. The variance between pCGNS andIOR is not large, and is likely caused by betterorganization of the I/O and disk allocation inthe IOR benchmark, since pCGNS does notknow ahead of time what the I/O and data willlook like. The lesser performance of pCGNSwith larger numbers of nodes is likely due tothe decreasing size of the data per process,whereas IOR keeps the data per process con-stant.

  • 9

    � � � � � � � � ��

    �������������

    � �����

    � �����

    ��� ���

    ��� ���

    � �����

    ��� ���

    � ������� !"#�$ %&

    ')(+*-,/.10325476�8:9

    ;=�23?A@CB�(EDGFIHKJ;=�23?A@CB�(EDLH5MKN;=�23?A@CB�(EDOMEF/H;=�23?A@CB�(EDGF/P)HRQ

    (a) Read bandwidth.

    � � � � � � � � ��

    �������������

    � �

    ���

    ���

    � ���

    ��� �

    � ���

    � ���

    ��� �

    � ��� �!� "#$%�& '(

    )+*-,/.1032547698�:4@?�45ACBED�*GFIHKJML=>4@?�45ACBED�*GFNJ7OMP=>4@?�45ACBED�*GFQOGH1J=>4@?�45ACBED�*GFIH1R+JTS

    (b) Write bandwidth.

    Figure 5: Multiple zones on one single node with up to eight cores.

    � ��� ��� ��� ��� � ��� ���

    �������

    �����

    �����

    �����

    �����

    �������

    �������

    �������

    �������

    �������

    � ����� � !"#$�% &'

    (*),+.-�/1032,465�78(9:2@?BA�)DCFE.G*HJI9:2@?BA�)DCKH1GJI,L9:2@?BA�)DCMI,G1NPO

    (a) Read bandwidth

    � ��� ��� ��� ��� � ��� ���

    �������

    �����

    �����

    ����

    �����

    ���

    �����

    � ������� ��� �! "#

    $&%('*)�+-,/.(021�34$56.879./:=�%@?BA*C&DFE56.879./:=�%@?GD-CFE(H56.879./:=�%@?IE(C-JLK

    (b) Write bandwidth

    Figure 6: I/O performance for one non-partitioned zone per process

    The read performance of pCGNS does notcompare as favorably with the IOR benchmarkas the write performance. The peak read per-formance from pCGNS is seen with 16 nodes,where it achieves a read bandwidth of ~1100MB/s, whereas IOR sees ~1600 MB/s. Afterthis point, pCGNS’s performance drops offwhile IOR’s continues to improve up to a peakof 2500 MB/s. The post-peak performance isagain due to the increasing size of data withIOR and constant data size with pCGNS. Thediscrepancy in the range of lower node counts

    is more indicative of the performance differ-ence between IOR and pCGNS. This differencemay be caused by the fact that IOR runs asingle read request against the storage, whereasthe pCGNS benchmark runs seven. With datarates so high compared to the size of the filesused, the difference due to added requests bypCGNS may be the culprit responsible for theperformance drop.

  • 10

    � ��� ��� ��� ��� � ��� ���

    �������

    �����

    �����

    �����

    �����

    �������

    �������

    �������

    �������

    �������

    � ����� � !"#$�% &'

    (*),+�-�.0/214365�798;:=0>A@CB

    DE1?FG1IH,JK

  • 11

    � ��� ��� ��� ��� � ��� ���

    �������

    �����

    �����

    �����

    �����

    �������

    �������

    �������

    �������

    �������

    � ����� � !"#$�% &'

    (*),+.-�/1032,465�798;:=*>A@CB

    DE2GFH23IKJL

  • 12

    � ��� ��� ��� ��� � ��� ���

    �������

    ���

    �����

    ����

    �����

    �����

    �������

    �������

    �������

    ������

    �������

    � ��� �!� "#$%�& '(

    )+*-,�.�/+0214365�798:;1=12?-@BA�*DCFEHG+IKJ:;1=12?-@BA�*DCLIMGKJON:;1=12?-@BA�*DCPJOGMQ4R

    (a) Read bandwidth

    � ��� ��� ��� ��� � ��� ���

    �������

    ��

    �����

    ����

    �����

    ���

    ����

    ����

    �����

    � ������� �� !�" #$

    %'&)(�*�+',.-0/21�35467-98:-.;)=�&@?BADC'EGF67-98:-.;)=�&@?HEICGFKJ67-98:-.;)=�&@?LFKCIM0N

    (b) Write bandwidth

    Figure 11: I/O performance for two partitioned zones per process

    [5] Weller, H., Tabor, G., Jasak, H., and Fureby, C., “A ten-sorial approach to computational continuum mechanicsusing object orientied techniques,” Computers in Physics,Vol. 12, No. 6, 1998, pp. 620 – 631. 2

    [6] Interface, M. P., “MPI-2: Extensions to the Message-Passing Interface,” 1996. 2

    [7] Hauser, T., “Parallel I/O for the CGNS system,” AMERI-CAN INSTITUTE OF AERONAUTICS AND ASTRONAU-TICS, , No. 1090, 2004. 4

    [8] Parimala D. Pakalapati, T. H., “Benchmarking Paral-lel I/O Performance for Computational Fluid DynamicsApplications,” AIAA Aerospace Sciences Meeting andExhibit, Vol. 43, 2005. 4

    [9] Thakur, R., Ross, R., Lusk, E., and Gropp, W., “UsersGuide for ROMIO: A High-Performance, Portable MPI-IO Implementation,” Tech. Rep. Technical MemorandumNo. 234, Mathematics and Computer Science Division,Argonne National Laboratory, Revised January 2002. 5

    [10] Thakur, R., Gropp, W., , and Lusk, E., “An Abstract-Device interface for Implementing Portable Parallel-I/OInterfaces(ADIO),” Proceedings of the 6th Symposium onthe Frontiers of Massively Parallel Computation, October1996, pp. 180–187. 5

    [11] Thakur, R., Gropp, W., , and Lusk, E., “On ImplementingMPI-I/O Portably and with High Performance,” Proceed-ings of the Sixth Workshop on Input/Output in Paralleland Distributed Systems, May 1999, pp. 23–32. 5

    [12] Rosario, J., Bordawekar, R., , and Choudhary, A., “Im-proved Parallel I/O via a Two-Phase Run-time AccessStrategy,” IPPS ’93 Parallel I/O Workshop, February1993. 5

    [13] Thakur, R., Bordawekar, R., Choudhary, A., and Pon-nusamy, R., “PASSION Runtime Library for ParallelI/O,” Scalable Parallel Libraries Conference, October1994. 5

    [14] Thakur, R. and Choudhary, A., “An Extended Two-PhaseMethod for Accessing Sections of Out-of-Core Arrays,”

    Scientific Programming, Vol. 5, No. 4, 1996, pp. 301–317. 5

    [15] Thakur, R., Gropp, W., and Lusk, E., “Data Sieving andCollective I/O in ROMIO,” Proceeding of the 7th Sympo-sium on the Frontiers of Massively Parallel Computation,February 1999, pp. 182–189. 5

    [16] “HDF4 Home Page,” The National Center for Supercom-puting Applications. http://hdf.ncsa.uiuc.edu/hdf4.html. 5

    [17] “HDF5 Home Page, The National Center for Supercom-puting Applications,” http://hdf.ncsa.uiuc.edu/HDF5/. 5

    [18] Li, J., Liao, W., Choudhary, A., and Taylor, V., “I/OAnalysis and Optimization for an AMR Cosmology Ap-plication,” Proceedings of IEEE Cluster 2002, Chicago,IL, September 2002. 5

    [19] Ross, R., Nurmi, D., Cheng, A., and Zingale, M., “A CaseStudy in Application I/O on Linux Clusters,” Proceedingsof SC2001, Denver, CO, November 2001. 5

    [20] Li, J., keng Liao, W., Choudhary, A., Ross, R., Thakur,R., Gropp, W., Latham, R., and Siegel, A., “ParallelnetCDF: A High-Performance Scientific I/O Interface,”Proceedings of the Supercomputering 2003 Conference,Phoenix, AZ, November 2003. 5

    [21] Rew, R. and Davis, G., “The Unidata netCDF: Soft-ware for Scientific Data Access,” Sixth InternationalConference on Interactive Information and ProcessingSystems for Meteorology, Oceanography and Hydrology,Anaheim, CA, February 1990. 5

    [22] “IOR HPC Benchmark,” Soureforge http://sourceforge.net/projects/ior-sio/. 6

    http:// hdf.ncsa.uiuc.edu/hdf4.htmlhttp://hdf.ncsa.uiuc.edu/HDF5/http://sourceforge.net/projects/ior-sio/http://sourceforge.net/projects/ior-sio/

  • w w w . h p c . u s u . e d u  

    An  Efficient  and  Flexible  Parallel  I/O  implementa

  • w w w . h p c . u s u . e d u  

    •  CGNS  •  Parallel  I/O  for  CGNS  •  Benchmarking  system  •  Benchmarking  results  –  IOR  – Parallel  CGNS  

    Outline  

  • w w w . h p c . u s u . e d u  

    What  is  CGNS  ?  •  CFD  General  Nota

  • w w w . h p c . u s u . e d u  

    What  is  CGNS  ?  •  Standard  Interface  Data  Structures  (SIDS)  

    –  Collec

  • w w w . h p c . u s u . e d u  

    CGNS  

  • w w w . h p c . u s u . e d u  

    I/O  Needs  on  Parallel  Computers  

    •  High  Performance  –  Take  advantage  of  parallel  I/O  paths  (when  available)  –  Support  for  applica

  • w w w . h p c . u s u . e d u  

    •  New  parallel  interface  •  Perform  I/O  collec

  • w w w . h p c . u s u . e d u  

    Parallel  CGNS  API  

  • w w w . h p c . u s u . e d u  

    Benchmarking  environment  Wasatch

    ...Node 1

    Node 2

    Node 64

    InfinibandM

    yrinet

    Node 1

    Node 32

    ...Node 33

    Node 64

    ...

    Uinta

    1 Gbit

    1 Gbit

    /home

    /opt

    /panfs

    Data CenterSwitch

  • w w w . h p c . u s u . e d u  

    IOR  results  

  • w w w . h p c . u s u . e d u  

    Benchmarks  for  pCGNS  

  • w w w . h p c . u s u . e d u  

    Reading  one  par

  • w w w . h p c . u s u . e d u  

    Wri

  • w w w . h p c . u s u . e d u  

    Reading  mul

  • w w w . h p c . u s u . e d u  

    Wri

  • w w w . h p c . u s u . e d u  

    •  Created  a  parallel  library  for  CGNS  data  I/O  •  Cover  all  I/O  paaerns  supported  for  mul

  • 13

    APPENDIX

  • 14

    Root node

    CGNSBase_tCGNSLibraryVersion_t

    Zone 1 Zone 2 ConvergenceHistory_t ReferenceState_t

    GridCoordinates_tElements_t FlowSolution_t ZoneGridConnectivity_t

    Zone_t Zone_t

    ElementConnectivity CoordinateYDataArray_t DataArray_t

    GridLocation Density PressureGridLocation_t DataArray_t DataArray_t

    ZoneBC_t

    CoordinateXDataArray_t

    Figure 12: Example CGNS file with zone being the node containing grid and solution information.

    Table I: API of the pCGNS library as implemented in pcgnslib.c and declared in pcgnslib.h.Software using the library to access CGNS files in parallel must use the routines listed here.

    General File Operationscgp_open Open a new file in parallelcgp_base_read Read the details of a base in the filecgp_base_write Write a new base to the filecgp_nbases Return the number of bases in the filecgp_zone_read Read the details of a zone in the basecgp_zone_type Read the type of a zone in the basecgp_zone_write Write a zone to the basecgp_nzones Return the number of zones in the base

    Coordinate Data Operationscgp_coord_write Create the node and empty array to store coordinate datacgp_coord_write_data Write coordinate data to the zone in parallel

    Unstructured Grid Connectivity Operationscgp_section_write Create the nodes and empty array to store grid connectivity for an unstructured

    meshcgp_section_write_data Write the grid connectivity to the zone in parallel for an unstructured mesh

    Solution Data Operationscgp_sol_write Create the node and empty array to store solution datacgp_sol_write_data Write solution data to the zone in parallel

    General Array Operationscgp_array_write Create the node and empty array to store general array datacgp_array_write_data Write general array data to the zone in parallel

    Queued I/O Operationsqueue_slice_write Queue a write operation to be executed laterqueue_flush Execute queued write operations

  • 15

    Wasatch

    ...Node 1

    Node 2

    Node 64

    InfinibandM

    yrinet

    Node 1

    Node 32

    ...Node 33

    Node 64

    ...

    Uinta

    1 Gbit

    1 Gbit

    /home

    /opt

    /panfs

    Data CenterSwitch

    Figure 13: Topology of cluster’s network attached storage at Utah State University. The oldercluster Uinta is connected to the root switch through either one or two gigabit switches, whereasall nodes in the new cluster Wasatch are connected directly to the root switch. All the storage isconnected directly to the root switch, including a parallel storage solution from Panasas mountedon /panfs.

    I IntroductionII I/O for Computational Fluid Dynamics SimulationsII-A The CGNS systemII-A1 ADF data formatII-A2 HDF5 data format

    III Parallel I/O for the CGNS systemIV Related WorkV Benchmarking Systems - Architectural DetailsVI Parallel I/O Characterization VI-A IOR benchmark resultsVI-B Benchmarks of the parallel CGNS implementationVI-B1 I/O performance for a single node with multiple zonesVI-B2 I/O performance for one not partitioned zone per processVI-B3 I/O performance for multiple not partitioned zones per processVI-B4 I/O performance for one partitioned zone per process with multiple zonesVI-B5 I/O performance for multiple partitioned zone per process with multiple zones

    VI-C Discussion

    VII SummaryVIII AcknowledgementsReferencesAppendix