Using IOR to Analyze the I/O Performance Hongzhang Shan, John Shalf NERSC

Using IOR to Analyze the I/O Performance

Hongzhang Shan, John Shalf

NERSC

Motivation

• HPC community has started to build the petaflop platforms.– System:

• I/O scalability: – Handling exponentially increasing concurrency– scale proportionally to flops ?

• Programming Interface– How to make programming of increasingly complex file

system easily accessible to users

– Application:• Workload survey/characterization (what applications

dominate our workload)• Understanding I/O requirements of key applications• Develop or adopt microbenchmarks that reflect those

requirements• Set performance expectations (now) and targets (future)

Outline

• Analyzing the NERSC workload

• Selecting benchmark to reflect workload requirements (eg. Why IOR ?)

• Using IOR to assess system performance

• Using IOR to predict I/O performance for full applications

Identify Application Requirements

• Identify users with demanding I/O requirements– Study NERSC allocations (ERCAP) – Study NERSC user surveys

• Approached sampling of top I/O users– Astrophysics (Cactus, FLASH,

CMB/MadCAP)– Materials– AMR framework (Chombo), etc.

Survey Results

• Access Pattern:– Sequential I/O patterns dominate– Writes dominate (exception: out-of-core CMB)

• Size of I/O Transaction– Broad Range: 1KB - tens of MB

• Typical Strategies for I/O– Run all I/O through one processor (serial)– One file per processor (multi-file parallel I/O)– MPI-IO to single file (single-file parallel I/O)– pHDF5 and parallelNetCDF (advanced self-

describing, platform-neutral file formats)

Potential Problems

• Run all I/O through one processor – Potential performance bottleneck– Does not fit distributed memory

• One file per – High overhead for metadata management

• A recent FLASH run on BG/L generates 75 million files

– Bad for archival storage (lots of small files)– Bad for metadata servers (lots of file creates)

• Need to use shared files or new interface

Migration to Parallel I/O

• Parallel I/O to single file is slowly emerging– Used to imply MPI-IO for correctness, but

concurrent Posix also works (now)– Motivated by need for fewer files– Simplifies data analysis, visualization– Simplifies archival storage

• Modest migration to high-level file formats pHDF5, parallelNetCDF– Motivated by portability & provenance concerns– Concerns about overhead of advanced file

formats

Benchmark Requirements

• Need to develop or adopt benchmark that reflects application requirements– Access Pattern– File Type– Programming Interface– File Size– Transaction Size– Concurrency

Synthetic Benchmarks

• Most synthetic benchmarks cannot be related to observed application IO patterns– Iozone, Bonnie, Self-Scaling

benchmark, SDSC I/O benchmark, Effective I/O Bandwidth, IOR, etc

• Deficiencies– Access pattern not realistic for HPC– Limited programming interface– Serial only

LLNL IOR Benchmark

• Developed by LLNL, used for purple procurement

• Focuses on parallel/sequential read/write operations that are typical in scientific applications

• Can exercise one file per processor or shared file accesses for common set of testing parameters (differential study)

• Exercises array of modern file APIs such as MPI-IO, POSIX (shared or unshared), pHDF5, parallelNetCDF

• Parameterized parallel file access patterns to mimic different application situations

IOR Design (shared file)

transferSize…

transferSize…

transferSize…

transferSize…

Segment

Segment

blockSize (data for P0)

blockSize(data for Pn)

blockSize(data for P0)


File Structure: Distributed Memory:

time step,or field

transferSize

transferSize

P0

Pn

…

• Important Parameters– blockSize– transferSize– API– Concurrency– fileType

IOR Design (shared file)

transferSize…

transferSize…

transferSize…

transferSize…

Segment

Segment

blockSize (data for P0)


blockSize(data for P0)


File Structure: Distributed Memory:

time step,dataset

transferSize

transferSize

P0

Pn

…

• Important Parameters– blockSize– transferSize– API– Concurrency– fileType

Datasetsin HDF5

and NetCDFnomenclature

IOR Design (One file per processor)

blockSize

blockSize

File Structure: Distributed Memory:File for P0

transferSize P0transferSize…

transferSize…

Segment

Segment

transferSize…

transferSize…Segment

Segment

File for Pn

transferSize Pn

Outline

• Why IOR ?

• Using IOR to study system performance

• Using IOR to predict I/O performance for application

Platforms

Machine Name

Parallel File

System

Proc Arch

Inter-connect

Peak IO BW

Max Node BW

to IO

Jaguar Lustre Opteron SeaStar 18*2.3GB/s = 42GB

3.2GB/s (1.2GB/s)

Bassi GPFS Power5 Federation 6*1GB/s = ~6.0GB/s

4.0GB/s (1.6GB/s)

• 18 DDN 9550 couplets on Jaguar, each delivers 2.3 - 3 GB/s

• Bassi has 6 VSDs with 8 non-redundant FC2 channels per VSD to achieve ~1GB/s per VSD. (2x redundancy of FC)

Effectiveunidirectional bandwidth inparenthesis

Caching Effects

Machine Name

Mem Per Node

Node Size

Mem/ Proc

Jaguar 8GB 2 4GB

Bassi 32GB 8 4GB

File Size Effect on Jaguar

0

50

100

150

200

250

300

350

400

450

500

16MB 32MB 64MB128MB256MB512MB1GB 2GB 4GB 8GB

File Size / Processor

MB/s Write

Read

File Size Effect on Bassi

100

1000

10000

100000

16MB 32MB 64MB128MB256MB512MB1GB 2GB 4GB 8GB

File Size / Processor

MB/s

Write

Read

Caching Effect

• On Bassi, file Size should be at least 256MB/ proc to avoid caching effect

• On Jaguar, we have not observed caching effect, 2GB/s for stable output

Transfer Size (P = 8)

• Large transfer size is critical on Jaguar to achieve performance

• The effect on Bassi is not as significant

0

500

1000

1500

2000

2500

3000

3500

4000

1 10 100 1000 10000 100000 1000000

TransferSize (KB)

MB/s

Bassi, Write

Jaguar , Write

Bassi, Read

Jaguar, Read

DSL Speed

HPC Speed

Scaling (No. of Processors)

• The I/O performance peaks at:– P = 256 on Jaguar (lstripe=144), – Close to peaks at P = 64 on Bassi

• The peak of I/O performance can often be achieved at relatively low concurrency

I/O Scaling on Bassi

0

1000

2000

3000

4000

5000

6000

7000

8 32 64 128 256

No. of Processors

MB/s Write

Read

Peak

I/O Scaling on Jaguar

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

8 32 64 128 256 1024

No. of Processors

MB/s

Write

ReadPeak

Shared vs. One file Per Proc

• The performance of using a shared file is very close to using one file per processor

• Using a shared file performs even better on Jaguar due to less metadata overhead

Bassi

0

1000

2000

3000

4000

5000

6000

7000

8 32 64 128 256

No. of Processors

MB/s Individual, Write

Shared, Write

Individual, Read

Shared, Read

Jaguar

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

8 32 64 128 256 1024

No. of Processors

MB/s

Individual, Write

Shared, Write

Individual, Read

Shared, Read

Programming Interface

• MPI-IO is close to POSIX performance• Concurrent POSIX access to single-file works correctly

– MPI-IO used to be required for correctness, but no longer

• HDF5 (v1.6.5) falls a little behind, but tracks MPI-IO performance• parallelNETCDF (v1.0.2pre) performs worst, and still has 4GB dataset size

limitation (due to limits on per-dimension sizes on latest version)

Bassi

0

1000

2000

3000

4000

5000

6000

0 50 100 150 200 250 300

No. of Processors

MB/sPosix

MPI-IO

HDF5

NETCDF

Programming Interface

• POSIX, MPI-IO, HDF5 (v1.6.5) offer very similar scalable performance

• parallelNetCDF (v1.0.2.pre): flat performance

Jaguar

0

5000

10000

15000

20000

25000

30000

8 32 64 128 256 1024

No. of Processors

MB/s

Posix

MPI-IO

HDF5

NETCDF

Outline

• Why IOR ?

• Using IOR to study system performance

• Using IOR to predict I/O performance for application

Madbench

• Astrophysics application, used to analyze the massive Cosmic Microwave Background datasets

• Important parameters related with IO:– Pixels: matrix size = pixels * pixels– Bins: number of matrices

• IO Behavior– Out-of-core app.– Matrix Write/Read

• Weak scaling problem – Pixels/Proc = 25K/16

I/O Performance Prediction for Madbench

• IOR parameters: TransferSize=16MB, blockSize=64MB, segmentCount=1, P=64

Madbench vs. IOR

-100%

-80%

-60%

-40%

-20%

0%

20%

40%

60%

80%

100%

Read,Individual

Read,Shared

Write,Individual

Write,Shared

% Prediction Error

Bassi

JaguarUnderprediction

Overprediction

Summary

• Surveyed the I/O requirements of NERSC applications and selected IOR as the synthetic benchmark to study the I/O performance

• I/O Performance– Highly affected by file size, I/O transaction size,

concurrency– Peaks at relatively low concurrency– The overhead of using HDF5 and MPI-IO is low,

but pNETCDF is high

• IOR could be used effectively for I/O performance prediction for some applications

Extra Material

Chombo

• Chombo is a tool package to solve the PDE problems on block-structured adaptively refined regular grids

• I/O is used to read/write the hierarchical grid structure at the end of each time step

• Test Problem: grid size = 400 * 400, 1 time step

Chombo I/O Behavior

• HDF5 interface• Block size varies

substantially, from 1KB - 10MB

3 4 7 1 5 6 2 98

1 2 3 4 5 6 7 98

P0 P1 P2

In File:

In Memory:

Distribution of Box Data Size

0

5

10

15

20

25

30

1024 2048 4096 819216384327686553613107226214452428810485762097152419430483886081.7E+073.4E+07

Box Data Size

Percentage (%)

I/O Performance Prediction for Chombo

Documents

Using IOR to Analyze the I/O Performance Hongzhang Shan, John Shalf NERSC