17
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. Liao W. Gropp, R. Ross, R. Thakur Northwestern University Argonne National Lab Parallel netCDF Enabling High Performance Application I/O

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Embed Size (px)

Citation preview

Page 1: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Project 4 :

SciDAC All Hands Meeting, September 11-13, 2002

A. Choudhary, W. Liao W. Gropp, R. Ross, R. Thakur

Northwestern University Argonne National Lab

Parallel netCDF

Enabling High Performance Application I/O

Page 2: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Outline

• NetCDF overview

• Parallel netCDF and MPI-IO

• Progress on API implementation

• Preliminary performance evaluation using LBNL test suite

Page 3: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

NetCDF OverviewnetCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths

lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes

float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius";

float rh(time,lat,lon); rh:long_name = "relative humidity";

rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time);

lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1";

// global attributes: :source = "Fictional Model Output";

data: // optional data assignments level = 1000, 850, 700, 500;

lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7,

.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,

.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,

.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated

}

• NetCDF (network Common Data Form) is an API for reading/writing multi-dimensional data arrays

• Self-describing file format– A netCDF file includes

information about the data it contains

• Machine independent– Portable file format

• Popular in both the fusion and climate communities

Page 4: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

NetCDF File Format• File header

– Stores metadata for fixed-size arrays:

• number of arrays, dimension lists, global attribute list, etc.

• Array data– Fixed-size arrays

• Stored contiguously in file

– Variable-size arrays• Records from all variable-sized

arrays are stored interleaved

Page 5: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

NetCDF APIs• Dataset APIs

– Create /open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk

• Define mode APIs– Define dataset: add dimensions, variables

• Attribute APIs– Add , change, and read attributes of datasets

• Inquiry APIs– Inquire dataset metadata: dim(id, name, len), var(name,

ndims, shape, id)

• Data mode APIs– Read/write variable (access method: single value, whole

array, subarray, strided subarray, sampled subarray)

Page 6: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Serial vs. Parallel netCDF• Serial netCDF

– Parallel read• Implemented by simply having all

processors read the file independently• Does NOT utilize native I/O provided on

parallel file system – miss parallel optimizations

– Sequential write• Parallel writes are carried out by shipping

data to a single process – overwhelm its memory capacity

• Parallel netCDF– Parallel read/write to a shared netCDF file– Built on top of MPI-IO which utilizes

optimal I/O facilities provided by the parallel file systems

– Can pass high-level access hints down to the file systems for further optimization

P0 P1 P2 P3

netCDF

Parallel File System

Parallel netCDF

P0 P1 P2 P3

Parallel File System

Page 7: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Design Parallel netCDF APIs• Goals

– Retain the original format• Applications using original netCDF applications can access the

same files

– A new set of parallel APIs• Prefix name “ncmpi_” and “nfmpi_”

– Similar APIs• Minimum changes from the original APIs for easy migration

– Portable across machines

– High performance• Tune the API to provide better performance in today’s computing

environments

Page 8: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Parallel File System

• Parallel file system consists of multiple I/O nodes– Increase bandwidth between

compute and I/O nodes

• Each I/O node may contain more than one disk– Increase bandwidth between disks

and I/O nodes

• A file is striped across all disks in a round-robin fashion– Maximize the possibility of parallel

access

switch network

I/OServer

I/OServer

I/OServer

Compute node

Compute node

Compute node

Compute node

File

. . .

. . .. . .

. . .. . .

. . .

. . .. . .. . .

. . .

Page 9: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Parallel netCDF

Parallel netCDF and MPI-IO• Parallel netCDF APIs are the

interfaces of applications to parallel file systems

• Parallel netCDF is implemented on top of MPI-IO

• ROMIO is an implementation of MPI-IO standard

• ROMIO is built on top of ADIO

• ADIO has implementations on various file systems, using optimal native I/O calls

Compute node

Compute node

Compute node

Compute node

switch network

I/OServer

I/OServer

I/OServer

ROMIO

ADIOUser space

File systemspace

Page 10: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Parallel API ImplementationsDataset APIs

– Collective calls– Add MPI communicator to define

I/O process scope– Add MPI_Info to pass access hint for

further optimization

• Define mode APIs– Collective calls

• Attribute APIs– Collective calls

• Inquiry APIs– Collective calls

• Data mode APIs– Collective mode (default)

• Ensure file consistency

– Independent mode

ncmpi_create/open(

MPI_Comm comm

const char *path,

int cmode,

MPI_Info info,

int ncidp);

ncmpi_begin_indep_data(int ncid);

ncmpi_end_indep_data(int ncid);

Switch in/out independent data mode

File open

Page 11: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Data Mode APIs• Collective and independent

calls– With suffix “_all” or not

• High-level APIs– Mimics the original APIs– Easy path of migration to the

parallel interface– Mapping netCDF access types

to MPI derived datatypes

• Flexible APIs– Better handling of internal data

representations– More fully expose the

capabilities of MPI-IO to the programmer

ncmpi_put/get_vars_types_all(

int ncid,

const MPI_Offset start[ ],

const MPI_Offset count[ ]

const MPI_Offset stride[ ],

const unsigned char *buf);

ncmpi_put/get_vars(

int ncid,

const MPI_Offset start[ ],

const MPI_Offset count[ ]

const MPI_Offset stride[ ],

void *buf,

int count,

MPI_Datatype datatype);

Flexible APIs

High-level APIs

Page 12: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

LBNL Benchmark• Test suite

– Developed by Chris Ding et al. at LBNL

– Written in Fortran

– Simple block partition patterns

• Access to a 3D array which is stored in a single netCDF file

• Running on IBM SP2 at NERSC, LBNL

– Each compute node is an SMP with 16 processors

– I/O is performed using all processors

processor 0 processor 4

XYZ partition

XY partitionXZ partitionYZ partition

processor 1

X partitionY partition

processor 2

processor 3

processor 5

Z partition

processor 6

processor 7

Y

X

Z

Page 13: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

LBNL Results – 64 MB

• Array size – 256 x 256 x 256, real*4• Read

– In some cases, performance improvement over the single processor– 8 processor parallel read is 2-3 times faster than the serial netCDF

• Write– Performance is not better than serial netCDF, 7-8 times slower

Page 14: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Our Results – 64 MB

• Array size: 256 x 256 x 256, real*4

• Run on IBM SP2 at SDSC

• I/O is performed using one processor per node

Number of processors

Ban

dw

idth

(M

B/s

ec)

Write 64 MB

YX

ZX

ZYY

Z X ZYX

168421

1000

100

10

1

0.1

Ban

dw

idth

(M

B/s

ec)

Number of processors

Read 64 MB

ZYX

YX

ZX

ZYY

Z X

0.1

1

10

100

1000

1 2 4 8 16

Page 15: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

LBNL Results – 1 GB

• Array size – 512 x 512 x 512, real*8• Read

– No better performance is observed

• Write– 4-8 processor writes results in 2-3 times higher bandwidth than using a

single processor

Page 16: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Our Results – 1 GB

• Array size: 512 x 512 x 512, real*8

• Run on IBM SP2 at SDSC

• I/O is performed using one processor per node

YXZX

ZYY

Z X ZYX

Number of processors

Ban

dw

idth

(M

B/s

ec)

Read 1 GB

10

1

0.11 2 4 8 16 32

100

1000

Number of processors

ZYXXYXZX

ZYY

Z

Write 1 GB

Ban

dw

idth

(M

B/s

ec)

1000

100

10

1

0.132168421

Page 17: Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab

Summary

• Complete the parallel C APIs

• Identify friendly users– ORNL, LBNL

• User reference manual

• Preliminary performance results– Using LBNL test suite: typical access patterns– Obtained scalable results