Upload
meghan-goodman
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Using IOR to Analyze the I/O Performance
Hongzhang Shan, John Shalf
NERSC
Motivation
• HPC community has started to build the petaflop platforms.– System:
• I/O scalability: – Handling exponentially increasing concurrency– scale proportionally to flops ?
• Programming Interface– How to make programming of increasingly complex file
system easily accessible to users
– Application:• Workload survey/characterization (what applications
dominate our workload)• Understanding I/O requirements of key applications• Develop or adopt microbenchmarks that reflect those
requirements• Set performance expectations (now) and targets (future)
Outline
• Analyzing the NERSC workload
• Selecting benchmark to reflect workload requirements (eg. Why IOR ?)
• Using IOR to assess system performance
• Using IOR to predict I/O performance for full applications
Identify Application Requirements
• Identify users with demanding I/O requirements– Study NERSC allocations (ERCAP) – Study NERSC user surveys
• Approached sampling of top I/O users– Astrophysics (Cactus, FLASH,
CMB/MadCAP)– Materials– AMR framework (Chombo), etc.
Survey Results
• Access Pattern:– Sequential I/O patterns dominate– Writes dominate (exception: out-of-core CMB)
• Size of I/O Transaction– Broad Range: 1KB - tens of MB
• Typical Strategies for I/O– Run all I/O through one processor (serial)– One file per processor (multi-file parallel I/O)– MPI-IO to single file (single-file parallel I/O)– pHDF5 and parallelNetCDF (advanced self-
describing, platform-neutral file formats)
Potential Problems
• Run all I/O through one processor – Potential performance bottleneck– Does not fit distributed memory
• One file per – High overhead for metadata management
• A recent FLASH run on BG/L generates 75 million files
– Bad for archival storage (lots of small files)– Bad for metadata servers (lots of file creates)
• Need to use shared files or new interface
Migration to Parallel I/O
• Parallel I/O to single file is slowly emerging– Used to imply MPI-IO for correctness, but
concurrent Posix also works (now)– Motivated by need for fewer files– Simplifies data analysis, visualization– Simplifies archival storage
• Modest migration to high-level file formats pHDF5, parallelNetCDF– Motivated by portability & provenance concerns– Concerns about overhead of advanced file
formats
Benchmark Requirements
• Need to develop or adopt benchmark that reflects application requirements– Access Pattern– File Type– Programming Interface– File Size– Transaction Size– Concurrency
Synthetic Benchmarks
• Most synthetic benchmarks cannot be related to observed application IO patterns– Iozone, Bonnie, Self-Scaling
benchmark, SDSC I/O benchmark, Effective I/O Bandwidth, IOR, etc
• Deficiencies– Access pattern not realistic for HPC– Limited programming interface– Serial only
LLNL IOR Benchmark
• Developed by LLNL, used for purple procurement
• Focuses on parallel/sequential read/write operations that are typical in scientific applications
• Can exercise one file per processor or shared file accesses for common set of testing parameters (differential study)
• Exercises array of modern file APIs such as MPI-IO, POSIX (shared or unshared), pHDF5, parallelNetCDF
• Parameterized parallel file access patterns to mimic different application situations
IOR Design (shared file)
transferSize…
transferSize…
transferSize…
transferSize…
Segment
Segment
blockSize (data for P0)
blockSize(data for Pn)
blockSize(data for P0)
blockSize(data for Pn)
File Structure: Distributed Memory:
time step,or field
transferSize
transferSize
P0
Pn
…
• Important Parameters– blockSize– transferSize– API– Concurrency– fileType
IOR Design (shared file)
transferSize…
transferSize…
transferSize…
transferSize…
Segment
Segment
blockSize (data for P0)
blockSize(data for Pn)
blockSize(data for P0)
blockSize(data for Pn)
File Structure: Distributed Memory:
time step,dataset
transferSize
transferSize
P0
Pn
…
• Important Parameters– blockSize– transferSize– API– Concurrency– fileType
Datasetsin HDF5
and NetCDFnomenclature
IOR Design (One file per processor)
blockSize
blockSize
File Structure: Distributed Memory:File for P0
transferSize P0transferSize…
transferSize…
Segment
Segment
transferSize…
transferSize…Segment
Segment
File for Pn
transferSize Pn
Outline
• Why IOR ?
• Using IOR to study system performance
• Using IOR to predict I/O performance for application
Platforms
Machine Name
Parallel File
System
Proc Arch
Inter-connect
Peak IO BW
Max Node BW
to IO
Jaguar Lustre Opteron SeaStar 18*2.3GB/s = 42GB
3.2GB/s (1.2GB/s)
Bassi GPFS Power5 Federation 6*1GB/s = ~6.0GB/s
4.0GB/s (1.6GB/s)
• 18 DDN 9550 couplets on Jaguar, each delivers 2.3 - 3 GB/s
• Bassi has 6 VSDs with 8 non-redundant FC2 channels per VSD to achieve ~1GB/s per VSD. (2x redundancy of FC)
Effectiveunidirectional bandwidth inparenthesis
Caching Effects
Machine Name
Mem Per Node
Node Size
Mem/ Proc
Jaguar 8GB 2 4GB
Bassi 32GB 8 4GB
File Size Effect on Jaguar
0
50
100
150
200
250
300
350
400
450
500
16MB 32MB 64MB128MB256MB512MB1GB 2GB 4GB 8GB
File Size / Processor
MB/s Write
Read
File Size Effect on Bassi
100
1000
10000
100000
16MB 32MB 64MB128MB256MB512MB1GB 2GB 4GB 8GB
File Size / Processor
MB/s
Write
Read
Caching Effect
• On Bassi, file Size should be at least 256MB/ proc to avoid caching effect
• On Jaguar, we have not observed caching effect, 2GB/s for stable output
Transfer Size (P = 8)
• Large transfer size is critical on Jaguar to achieve performance
• The effect on Bassi is not as significant
0
500
1000
1500
2000
2500
3000
3500
4000
1 10 100 1000 10000 100000 1000000
TransferSize (KB)
MB/s
Bassi, Write
Jaguar , Write
Bassi, Read
Jaguar, Read
DSL Speed
HPC Speed
Scaling (No. of Processors)
• The I/O performance peaks at:– P = 256 on Jaguar (lstripe=144), – Close to peaks at P = 64 on Bassi
• The peak of I/O performance can often be achieved at relatively low concurrency
I/O Scaling on Bassi
0
1000
2000
3000
4000
5000
6000
7000
8 32 64 128 256
No. of Processors
MB/s Write
Read
Peak
I/O Scaling on Jaguar
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
8 32 64 128 256 1024
No. of Processors
MB/s
Write
ReadPeak
Shared vs. One file Per Proc
• The performance of using a shared file is very close to using one file per processor
• Using a shared file performs even better on Jaguar due to less metadata overhead
Bassi
0
1000
2000
3000
4000
5000
6000
7000
8 32 64 128 256
No. of Processors
MB/s Individual, Write
Shared, Write
Individual, Read
Shared, Read
Jaguar
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
8 32 64 128 256 1024
No. of Processors
MB/s
Individual, Write
Shared, Write
Individual, Read
Shared, Read
Programming Interface
• MPI-IO is close to POSIX performance• Concurrent POSIX access to single-file works correctly
– MPI-IO used to be required for correctness, but no longer
• HDF5 (v1.6.5) falls a little behind, but tracks MPI-IO performance• parallelNETCDF (v1.0.2pre) performs worst, and still has 4GB dataset size
limitation (due to limits on per-dimension sizes on latest version)
Bassi
0
1000
2000
3000
4000
5000
6000
0 50 100 150 200 250 300
No. of Processors
MB/sPosix
MPI-IO
HDF5
NETCDF
Programming Interface
• POSIX, MPI-IO, HDF5 (v1.6.5) offer very similar scalable performance
• parallelNetCDF (v1.0.2.pre): flat performance
Jaguar
0
5000
10000
15000
20000
25000
30000
8 32 64 128 256 1024
No. of Processors
MB/s
Posix
MPI-IO
HDF5
NETCDF
Outline
• Why IOR ?
• Using IOR to study system performance
• Using IOR to predict I/O performance for application
Madbench
• Astrophysics application, used to analyze the massive Cosmic Microwave Background datasets
• Important parameters related with IO:– Pixels: matrix size = pixels * pixels– Bins: number of matrices
• IO Behavior– Out-of-core app.– Matrix Write/Read
• Weak scaling problem – Pixels/Proc = 25K/16
I/O Performance Prediction for Madbench
• IOR parameters: TransferSize=16MB, blockSize=64MB, segmentCount=1, P=64
Madbench vs. IOR
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
Read,Individual
Read,Shared
Write,Individual
Write,Shared
% Prediction Error
Bassi
JaguarUnderprediction
Overprediction
Summary
• Surveyed the I/O requirements of NERSC applications and selected IOR as the synthetic benchmark to study the I/O performance
• I/O Performance– Highly affected by file size, I/O transaction size,
concurrency– Peaks at relatively low concurrency– The overhead of using HDF5 and MPI-IO is low,
but pNETCDF is high
• IOR could be used effectively for I/O performance prediction for some applications
Extra Material
Chombo
• Chombo is a tool package to solve the PDE problems on block-structured adaptively refined regular grids
• I/O is used to read/write the hierarchical grid structure at the end of each time step
• Test Problem: grid size = 400 * 400, 1 time step
Chombo I/O Behavior
• HDF5 interface• Block size varies
substantially, from 1KB - 10MB
3 4 7 1 5 6 2 98
1 2 3 4 5 6 7 98
P0 P1 P2
In File:
In Memory:
Distribution of Box Data Size
0
5
10
15
20
25
30
1024 2048 4096 819216384327686553613107226214452428810485762097152419430483886081.7E+073.4E+07
Box Data Size
Percentage (%)
I/O Performance Prediction for Chombo