Upload
blanche-cooper
View
216
Download
3
Embed Size (px)
Citation preview
Project 4 :
SciDAC All Hands Meeting, September 11-13, 2002
A. Choudhary, W. Liao W. Gropp, R. Ross, R. Thakur
Northwestern University Argonne National Lab
Parallel netCDF
Enabling High Performance Application I/O
Outline
• NetCDF overview
• Parallel netCDF and MPI-IO
• Progress on API implementation
• Preliminary performance evaluation using LBNL test suite
NetCDF OverviewnetCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths
lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes
float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius";
float rh(time,lat,lon); rh:long_name = "relative humidity";
rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time);
lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1";
// global attributes: :source = "Fictional Model Output";
data: // optional data assignments level = 1000, 850, 700, 500;
lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7,
.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,
.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,
.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated
}
• NetCDF (network Common Data Form) is an API for reading/writing multi-dimensional data arrays
• Self-describing file format– A netCDF file includes
information about the data it contains
• Machine independent– Portable file format
• Popular in both the fusion and climate communities
NetCDF File Format• File header
– Stores metadata for fixed-size arrays:
• number of arrays, dimension lists, global attribute list, etc.
• Array data– Fixed-size arrays
• Stored contiguously in file
– Variable-size arrays• Records from all variable-sized
arrays are stored interleaved
NetCDF APIs• Dataset APIs
– Create /open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk
• Define mode APIs– Define dataset: add dimensions, variables
• Attribute APIs– Add , change, and read attributes of datasets
• Inquiry APIs– Inquire dataset metadata: dim(id, name, len), var(name,
ndims, shape, id)
• Data mode APIs– Read/write variable (access method: single value, whole
array, subarray, strided subarray, sampled subarray)
Serial vs. Parallel netCDF• Serial netCDF
– Parallel read• Implemented by simply having all
processors read the file independently• Does NOT utilize native I/O provided on
parallel file system – miss parallel optimizations
– Sequential write• Parallel writes are carried out by shipping
data to a single process – overwhelm its memory capacity
• Parallel netCDF– Parallel read/write to a shared netCDF file– Built on top of MPI-IO which utilizes
optimal I/O facilities provided by the parallel file systems
– Can pass high-level access hints down to the file systems for further optimization
P0 P1 P2 P3
netCDF
Parallel File System
Parallel netCDF
P0 P1 P2 P3
Parallel File System
Design Parallel netCDF APIs• Goals
– Retain the original format• Applications using original netCDF applications can access the
same files
– A new set of parallel APIs• Prefix name “ncmpi_” and “nfmpi_”
– Similar APIs• Minimum changes from the original APIs for easy migration
– Portable across machines
– High performance• Tune the API to provide better performance in today’s computing
environments
Parallel File System
• Parallel file system consists of multiple I/O nodes– Increase bandwidth between
compute and I/O nodes
• Each I/O node may contain more than one disk– Increase bandwidth between disks
and I/O nodes
• A file is striped across all disks in a round-robin fashion– Maximize the possibility of parallel
access
switch network
I/OServer
I/OServer
I/OServer
Compute node
Compute node
Compute node
Compute node
File
. . .
. . .. . .
. . .. . .
. . .
. . .. . .. . .
. . .
Parallel netCDF
Parallel netCDF and MPI-IO• Parallel netCDF APIs are the
interfaces of applications to parallel file systems
• Parallel netCDF is implemented on top of MPI-IO
• ROMIO is an implementation of MPI-IO standard
• ROMIO is built on top of ADIO
• ADIO has implementations on various file systems, using optimal native I/O calls
Compute node
Compute node
Compute node
Compute node
switch network
I/OServer
I/OServer
I/OServer
ROMIO
ADIOUser space
File systemspace
Parallel API ImplementationsDataset APIs
– Collective calls– Add MPI communicator to define
I/O process scope– Add MPI_Info to pass access hint for
further optimization
• Define mode APIs– Collective calls
• Attribute APIs– Collective calls
• Inquiry APIs– Collective calls
• Data mode APIs– Collective mode (default)
• Ensure file consistency
– Independent mode
ncmpi_create/open(
MPI_Comm comm
const char *path,
int cmode,
MPI_Info info,
int ncidp);
ncmpi_begin_indep_data(int ncid);
ncmpi_end_indep_data(int ncid);
Switch in/out independent data mode
File open
Data Mode APIs• Collective and independent
calls– With suffix “_all” or not
• High-level APIs– Mimics the original APIs– Easy path of migration to the
parallel interface– Mapping netCDF access types
to MPI derived datatypes
• Flexible APIs– Better handling of internal data
representations– More fully expose the
capabilities of MPI-IO to the programmer
ncmpi_put/get_vars_types_all(
int ncid,
const MPI_Offset start[ ],
const MPI_Offset count[ ]
const MPI_Offset stride[ ],
const unsigned char *buf);
ncmpi_put/get_vars(
int ncid,
const MPI_Offset start[ ],
const MPI_Offset count[ ]
const MPI_Offset stride[ ],
void *buf,
int count,
MPI_Datatype datatype);
Flexible APIs
High-level APIs
LBNL Benchmark• Test suite
– Developed by Chris Ding et al. at LBNL
– Written in Fortran
– Simple block partition patterns
• Access to a 3D array which is stored in a single netCDF file
• Running on IBM SP2 at NERSC, LBNL
– Each compute node is an SMP with 16 processors
– I/O is performed using all processors
processor 0 processor 4
XYZ partition
XY partitionXZ partitionYZ partition
processor 1
X partitionY partition
processor 2
processor 3
processor 5
Z partition
processor 6
processor 7
Y
X
Z
LBNL Results – 64 MB
• Array size – 256 x 256 x 256, real*4• Read
– In some cases, performance improvement over the single processor– 8 processor parallel read is 2-3 times faster than the serial netCDF
• Write– Performance is not better than serial netCDF, 7-8 times slower
Our Results – 64 MB
• Array size: 256 x 256 x 256, real*4
• Run on IBM SP2 at SDSC
• I/O is performed using one processor per node
Number of processors
Ban
dw
idth
(M
B/s
ec)
Write 64 MB
YX
ZX
ZYY
Z X ZYX
168421
1000
100
10
1
0.1
Ban
dw
idth
(M
B/s
ec)
Number of processors
Read 64 MB
ZYX
YX
ZX
ZYY
Z X
0.1
1
10
100
1000
1 2 4 8 16
LBNL Results – 1 GB
• Array size – 512 x 512 x 512, real*8• Read
– No better performance is observed
• Write– 4-8 processor writes results in 2-3 times higher bandwidth than using a
single processor
Our Results – 1 GB
• Array size: 512 x 512 x 512, real*8
• Run on IBM SP2 at SDSC
• I/O is performed using one processor per node
YXZX
ZYY
Z X ZYX
Number of processors
Ban
dw
idth
(M
B/s
ec)
Read 1 GB
10
1
0.11 2 4 8 16 32
100
1000
Number of processors
ZYXXYXZX
ZYY
Z
Write 1 GB
Ban
dw
idth
(M
B/s
ec)
1000
100
10
1
0.132168421
Summary
• Complete the parallel C APIs
• Identify friendly users– ORNL, LBNL
• User reference manual
• Preliminary performance results– Using LBNL test suite: typical access patterns– Obtained scalable results