Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Parallel I/OInternational HPC Summer School
July 11, 2018 Elsa GonsiorowskiHPC I/O Specialist, LLNL
LLNL-PRES-751922This work was performed under the auspices of the U.S. Department of Energy by Lawrence LivermoreNational Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Outline
MotivationI/O in ParallelStep 1: Recognize a needStep 2: Existing I/O Libraries and ToolsStep 3: I/O PatternsStep 4: Understand the File SystemStep 6: Profit
Technical Details: MPI I/OPro-Tips!
LLNL-PRES-751922 2
Motivation
LLNL-PRES-751922 3
Types of I/O
InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files
OutputCheckpointsResults
ScienceMoving files from onemachine to anotherCleaning up after experiments
Everyone interacts with a file system therefore everyone does I/O!
LLNL-PRES-751922 4
Types of I/O
InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files
OutputCheckpointsResults
ScienceMoving files from onemachine to anotherCleaning up after experiments
Everyone interacts with a file system therefore everyone does I/O!
LLNL-PRES-751922 4
Types of I/O
InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files
OutputCheckpointsResults
ScienceMoving files from onemachine to anotherCleaning up after experiments
Everyone interacts with a file system therefore everyone does I/O!
LLNL-PRES-751922 4
Types of I/O
InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files
OutputCheckpointsResults
ScienceMoving files from onemachine to anotherCleaning up after experiments
Everyone interacts with a file system therefore everyone does I/O!
LLNL-PRES-751922 4
Why should I care?
Datamovement is expensive andmust be optimized
Total execution time =Computation time
LLNL-PRES-751922 5
Why should I care?
Datamovement is expensive andmust be optimized
Total execution time =Computation time
LLNL-PRES-751922 5
Why should I care?
Datamovement is expensive andmust be optimized
Total execution time =Computation time+Communication time
LLNL-PRES-751922 5
Why should I care?
Datamovement is expensive andmust be optimized
Total execution time =Computation time+Communication time+I/O time
LLNL-PRES-751922 5
HPC Storage Stack
GPUMemory (HBM2): 900GB/s
CPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/s
Node-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/s
PFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer
"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage
"campaign store"HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s
LLNL-PRES-751922 6
HPC Storage Stack
GPUMemory (HBM2): 900GB/s per GPUCPUMemory (DDR4): 120 GB/s per socketNode-local storage (SSD): 1.1 GB/s per nodePFS (HDD+ SSD +Magic): 40 GB/s shared by a system
burst buffer"project" storage"campaign store"
HPSS (Tape + Robots): 0.2 GB/s shared by a center
LLNL-PRES-751922 6
File SystemsLaptop
1 user1.1 GB/s
Network FileSystem (NFS)
m servers, n clientshome directory2 GB/s throughput280K IOPS
Parallel File System(PFS)
Used byHPC jobsSystem specificscratch or project storage40GB/s throughputMillions of IOPS
LLNL-PRES-751922 7
Parallel File System
LLNL-PRES-751922 8
Parallel File System
LLNL-PRES-751922 9
Parallel File System
LLNL-PRES-751922 9
I/O in Parallel
LLNL-PRES-751922 10
Steps for Dealing with I/O
1. Recognize the need
Get some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the application
Get some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application faster
Deal with files efficiently2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern
4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on
5. ???6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???
6. Profit!
LLNL-PRES-751922 11
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 11
Step 1: Recognize a need
LLNL-PRES-751922 12
Profiling
DarshanTau
Attend tomorrow’s performance analysis session!
LLNL-PRES-751922 13
Profiling
DarshanTau
Attend tomorrow’s performance analysis session!
LLNL-PRES-751922 13
Step 2: Existing Libraries + Tools
LLNL-PRES-751922 14
Parallel I/O Libraries and Tools
Reading &Writing Files:HDF5PnetCDFOthers: ADIOS, TyphonIO,SILOMPI-IO
Managing Files:SpindlempiFileUtilsSCR
LLNL-PRES-751922 15
Library: HDF5
Hierarchical Data FormatFile-system in a fileDatasets: multidimensional arrays of a homogeneous typeGroups: container structures which can hold datasets andother groupsOfficial support for C, C++, Fortran 77, Fortran 90, JavaImplementations in R, Perl, Python, Ruby, Haskell,Mathematica, MATLAB, etc.
LLNL-PRES-751922 16
Library: PNetCDF
Built on netCDF andMPI-IOnetCDF:
self-describing, machine-independent formatdesigned for arrays of scientific datanetCDF is implemented in C, C++, Fortran 77, Fortran 90,Java, R, Perl, Python, Ruby, Haskell, Mathematica, MATLAB,etc.
LLNL-PRES-751922 17
Library: MPI-IO
API for interacting with files withMPI conceptsblocking vs. non-blockingcollective vs. non-collective
Lower level than other librariesFine-grain control of files and offsetsC and Fortran interfacesSeparate effort from regularMPI
LLNL-PRES-751922 18
Tool: Spindle
Scalable dynamic library and Python loadingCaches linked librariesLife saver for NFS issues
https://github.com/hpc/spindle
LLNL-PRES-751922 19
Tool: mpiFileUtils
Use parallel processes to perform file operationsExecutedwithin a job allocationdbcast: broadcast a file from PFS to node-local storagedcp: copymultiple file in paralleldrm: delete files in parallelmanymore
https://github.com/hpc/mpifileutils
LLNL-PRES-751922 20
Library: SCR
Scalable Checkpoint RestartEnable checkpointing applications totake advantage of system storagehierarchiesEfficient file movement betweenstorage layersData redundancy operations
LLNL-PRES-751922 21
Step 3: I/O Patterns
LLNL-PRES-751922 22
Parallel I/O Patterns
Single file, accessed by 1 task
Single shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasks
Many shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passing
Coordinated "View"Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasks
One file per process
LLNL-PRES-751922 23
Parallel I/O Patterns
Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks
Baton-passingCoordinated "View"
Many independent files, accessed by a subset of tasksOne file per process
LLNL-PRES-751922 23
Step 4: Understand the PFS
LLNL-PRES-751922 24
Parallel File SystemPolicies
Allocation: howmuch space you have
Backups: if backups or snapshots are createdPurges: when data is deletedConfiguration: I/O pattern system is configured for
LLNL-PRES-751922 25
Parallel File SystemPolicies
Allocation: howmuch space you haveBackups: if backups or snapshots are created
Purges: when data is deletedConfiguration: I/O pattern system is configured for
LLNL-PRES-751922 25
Parallel File SystemPolicies
Allocation: howmuch space you haveBackups: if backups or snapshots are createdPurges: when data is deleted
Configuration: I/O pattern system is configured for
LLNL-PRES-751922 25
Parallel File SystemPolicies
Allocation: howmuch space you haveBackups: if backups or snapshots are createdPurges: when data is deletedConfiguration: I/O pattern system is configured for
LLNL-PRES-751922 25
Parallel File Systems
BlackMagic: IBM’s GPFS (general parallel file system)Closed sourceaka Elastic Scale Storage™ or Spectrum Scale™HPC users do not have knobs to tune
WhiteMagic: LustreOpen sourceUsers can deviate from default behavior
LLNL-PRES-751922 26
Parallel File Systems
BlackMagic: IBM’s GPFS (general parallel file system)Closed sourceaka Elastic Scale Storage™ or Spectrum Scale™HPC users do not have knobs to tune
WhiteMagic: LustreOpen sourceUsers can deviate from default behavior
LLNL-PRES-751922 26
Lustre Striping
HDDs are logically grouped intoOSTs (Object StorageTargets)Users can stripe a file across multiple OSTs
Explicitly take advantage of multiple OSTsDepends on the total amount of I/O you are doingThere is a system default
Use the correct striping for your use case
LLNL-PRES-751922 27
Lustre Striping Commands
$ lfs setstripe -c 4 -s 4M testfile2$ lfs getstripe ./testfile2./testfile2lmm_stripe_count: 4lmm_stripe_size: 4194304lmm_stripe_offset: 21
obdidx objid objid group50 8916056 0x880c58 038 8952827 0x889bfb 0
LLNL-PRES-751922 28
Lustre Striping Commands
$ lfs getstripe ./testfile./testfilelmm_stripe_count: 2lmm_stripe_size: 1048576lmm_stripe_offset: 50
obdidx objid objid group21 8891547 0x87ac9b 013 8946053 0x888185 057 8906813 0x87e83d 044 8945736 0x888048 0
LLNL-PRES-751922 29
Step 6: Profit
LLNL-PRES-751922 30
Steps for Dealing with I/O
1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently
2. Investigate I/O libraries and tools, onemay be common inyour field.
3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!
LLNL-PRES-751922 31
Technical Details: MPI I/O
LLNL-PRES-751922 32
Locking and Atomicity
$ export BGLOCKLESSMPIO_F_TYPE=1
int MPI_File_set_atomicity ( MPI_File mpi_fh, int flag );
LLNL-PRES-751922 33
Opening Files
int MPI_File_open(MPI_Comm comm, const char *filename,int amode, MPI_Info info, MPI_File *fh);
AMode DescriptionMPI_MODE_RDONLY read onlyMPI_MODE_RDWR reading andwritingMPI_MODE_WRONLY write onlyMPI_MODE_CREATE create the fileMPI_MODE_EXCL error if file already existsMPI_MODE_DELETE_ON_CLOSE delete file on closeMPI_MODE_UNIQUE_OPEN file will not be concurrently openedMPI_MODE_SEQUENTIAL file will only be accessed sequentiallyMPI_MODE_APPEND position of all file pointers to end
LLNL-PRES-751922 34
Organizing Data
Use MPI_Datatype to define the structure of your dataCorresponds to C struct
Read andwrite instances of this dataUse MPI_File_set_view for working with non-contiguousdata in a shared file
LLNL-PRES-751922 35
UsefulMPI Function
offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,
MPI_SUM, file_comm);
Rank 0 1 2 3 4contribute 3 4 2 7 3offset 0 3 7 9 16
LLNL-PRES-751922 36
UsefulMPI Function
offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,
MPI_SUM, file_comm);
Rank 0 1 2 3 4contribute 3 4 2 7 3
offset 0 3 7 9 16
LLNL-PRES-751922 36
UsefulMPI Function
offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,
MPI_SUM, file_comm);
Rank 0 1 2 3 4contribute 3 4 2 7 3offset 0 3 7 9 16
LLNL-PRES-751922 36
Accessing Files withMPI
LLNL-PRES-751922 37
Accessing Files withMPI
LLNL-PRES-751922 37
Accessing Files withMPILevel 0independent file ops, explicit offset, sequential dataLevel 1collective file ops, explicit offset, sequential dataLevel 2independent file ops, derived or non-contiguous dataLevel 3collective file ops, derived or non-contiguous data
LLNL-PRES-751922 38
MPI I/O& Lustre
Can be built by HPC resource providers with Lustreintegration
mpi_info_set(myinfo, "striping_factor", stripe_count);mpi_info_set(myinfo, "striping_unit", stripe_size);mpi_info_set(myinfo, "cb_nodes", num_writers);
LLNL-PRES-751922 39
Pro-Tips!
LLNL-PRES-751922 40
Pro-Tip!
StepOneProfile your code. Fix up the I/O until it doesn’t suck.
LLNL-PRES-751922 41
Pro-Tip!
Be SmartDon’t re-invent I/O, use an existing library or tool.
LLNL-PRES-751922 42
Pro-Tip!
Working with File SystemsUse the PFS for Parallel I/O, do NOT use NFS.
LLNL-PRES-751922 43
Pro-Tip!
I/O PatternCreate 1 file per node andmake this a tune-able parameter.
LLNL-PRES-751922 44
Pro-Tip!
Ask an ExpertFind the "I/O person" at your HPC center and ask for guidance.
LLNL-PRES-751922 45
This document was prepared as an account of work sponsored by an agency of the United States government. Neitherthe United States government nor Lawrence Livermore National Security, LLC, nor any of their employeesmakes anywarranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, orusefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringeprivately owned rights. Reference herein to any specific commercial product, process, or service by trade name,trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, orfavoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions ofauthors expressed herein do not necessarily state or reflect those of the United States government or LawrenceLivermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.