Parallel I/O...Parallel and Distributed Systems, 13(7):728-744, July 2002. High Performance with Derived Data Types (Thakur et. al: SC 98) Potential of parallel file systems not fully

Parallel I/O

Sources/Credits: R. Thakur, W. Gropp, E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Supercomputing 98http://www.cs.dartmouth.edu/pario/bib/short.html (bibliography)Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and Shengke Yu. Improving MPI IO output performance with active buffering plus threads. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, April 2003.Mahmut Kandemir. Compiler-directed collective I/O. IEEE Transactions on Parallel and Distributed Systems, 12(12):1318-1331, December 2001.Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok Choudhary, and David Bernholdt. An experimental evaluation of I/O optimizations on different applications. IEEE Transactions on Parallel and Distributed Systems, 13(7):728-744, July 2002.

http://www.cs.dartmouth.edu/pario/bib/short.html

http://drl.cs.uiuc.edu/pubs/abt.pdf

http://computer.org/tpds/td2001/l1318abs.htm

High Performance with Derived Data Types (Thakur et. al: SC 98) Potential of parallel file systems not fully

utilized because of application’s I/O access patterns

b. Many small requests to non-contiguous blocks

c. Most parallel file systems access single large chunk

Thus motivation for making a single call using derived data types

ROMIO performs 2 optimizations – data sieving and collective I/O

ROMIO Architecture

Datatype Constructors in MPI

1. contiguous2. vector/hvector3. indexed/hindexed/indexed_block4. struct5. subarray6. darray

I I I I

I I I I I I I I I I

I I I I I I I I I

I I I D D D D C C

Different levels of access



Optimizations in ROMIO for derived-datatype noncontiguous access

1. Data sieving• Make a few, large contiguous requests to the file

system even if the user’s requests consists of several, small, nocontiguous requests

• Extract (pick out data) in memory that is really needed

• This is ok for read? For write?

• Use small buffer for writing with data sieving than for reading with data sieving. Why?

Read-modify-write along with locking

Greater the size of the write buffer, greater the contention among processes for locks

Optimizations in ROMIO for derived-datatype noncontiguous access1. Data sieving2. Collective I/O

• During collective-I/O functions, the implementation can analyze and merge the requests of different processes

• The merged request can be large and continuous although the individual requests were noncontiguous.

• Perform I/O in 2 phases:• I/O phase – processes perform I/O for the merged

request. Some data may belong to other processes. If the merged request is not contiguous, use data sieving

• Communication phase – processes redistribute data to obtain the desired distribution

• Additional cost of communication phase can be offset by performance gain due to contiguous access.

• Data sieving and collective-I/O also help improve caching and prefetching in underlying file system

Collective I/O Illustration

P0 P1 P0 P1

P0 P1

P0 P1 P0 P1

P0 P1 P0 P1

ResultsTable 1: Read performance for distributedarray access (array size 512 x 512 x 512 integers, file size 512 Mbytes)

17511814.032SGI Origin2000

5633220.718NEC SX4

1329.503.01256Intel Paragon

90.211.92.1364IBM SP

68.214.25.4264HP Exemplar

Level 3Level 2Level 0/1ProcessorsMachine

Bandwidth (Mbytes/s)

If requests of processes that call a collective function is not interleaved in the file, ROMIO’S collective implementations just calls corresponding independent-I/O function on each process. Hence Level 1 = Level 0

Improvement due to data-sieving

Improvement due to collective-I/O

ResultsTable 3: Write performance for distributedarray access (array size 512 x 512 x 512 integers, file size 512 Mbytes)

66.713.15.0632SGI Origin2000

44775.30.628NEC SX4

1833.331.12256Intel Paragon

57.61.851.8564IBM SP

50.71.250.5464HP Exemplar

Level 3Level 2Level 0/1ProcessorsMachine

Bandwidth (Mbytes/s) IBM SP’s PIOFS does not support file locking

Active Buffering with Threads (Xiaosong Ma et. al: IPDPS 2003)

Above optimizations alone are not enough.

Active Buffering – use of separate I/O nodes

Overlapping I/O access with computation by threads

Buffer space automatically adjusted to available memory

Original Scheme (Ma: IPDPS 2002) Hierarchical buffering scheme Dedicated I/O server nodes During I/O: if(not overflow in compute nodes) compute nodes -> local buffers else if(not overflow in server nodes) compute nodes ->server buffers (using MPI) else server nodes -> I/O system During computation: Server nodes clear local buffers and I/O write Fetch data from compute nodes (one-sided communication)

and I/O write

Current Scheme I/O threads collective I/O overlapped with

main threads computation and communication

Uses pthreads with kernel-level scheduling Interception of ROMIO’s I/O calls Main threads and I/O threads coordinate by

buffer queue Producer-consumer and bounded-buffer

problem

Execution Timeline

Other issues

Background thread initiated during first collective I/O

Interesting termination During MPI_FILE_CLOSE, a special buffer

with special tag is appended to the buffer space. On seeing this, the background thread terminates.

Compiler-directed collective I/O (Kandemir: 2001)

Under what circumstances are collective-I/O useful. Should we use level 3 access all the time?

Compiler analysis of data access and storage access patterns

Selective insertion of MPI collective I/O or independent I/O calls

Example Conforming and non-conforming access

patterns

Bibliography Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur.

PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317-327, Atlanta, GA, October 2000. USENIX Association.

Jose Aguilar. A graph theoretical model for scheduling simultaneous I/O operations on parallel and distributed environments. Parallel Processing Letters, 12(1):113-126, March 2002.

Rajesh Bordawekar. Implementation of collective I/O in the Intel Paragon parallel file system: Initial experiences. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 20-27. ACM Press, July 1997.

Peter Brezany, Marianne Winslett, Denis A. Nicole, and Toni Cortes. Parallel I/O and storage technology. In Proceedings of the Seventh International Euro-Par Conference, volume 2150 of Lecture Notes in Computer Science, pages 887-888, Manchester, UK, August 2001. Springer-Verlag.

Bradley Broom, Rob Fowler, and Ken Kennedy. KelpIO: A telescope-ready domain-specific I/O library for irregular block-structured applications. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 148-155, Brisbane, Australia, May 2001. IEEE Computer Society Press

http://www.mcs.anl.gov/~thakur/papers/pvfs.ps



http://www.worldscinet.com/ppl/12/1201/S0129626402000860.html

http://www.cacr.caltech.edu/~rajesh/ics97.ps

http://link.springer.de/link/service/series/0558/bibs/2150/21500887.htm

http://ieeexplore.ieee.org/iel5/7358/19961/00923187.pdf

http://ieeexplore.ieee.org/iel5/7358/19961/00923187.pdf

Bibliography J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso.

I/O data mapping in \em ParFiSys: support for high-performance I/O in parallel and distributed systems. In Euro-Par '96, volume 1123 of Lecture Notes in Computer Science, pages 522-526. Springer-Verlag, August 1996 Ying Chen, Marianne Winslett, Y. Cho, and S. Kuo.

Automatic parallel I/O performance optimization using genetic algorithms. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pages 155-162. IEEE Computer Society Press, July 1998.

Ying Chen, Ian Foster, Jarek Nieplocha, and Marianne Winslett. Optimizing collective I/O performance on parallel computers: A multisystem study. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 28-35. ACM Press, July 1997.

Avery Ching, Alok Choudhary, Kenin Coloma, Wei keng Liao, Robert Ross, and William Gropp. Noncontiguous I/O accesses through MPI-IO. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 104-111, Tokyo, Japan, May 2003. IEEE Computer Society Press.

Phillip M. Dickens and Rajeev Thakur. Evaluation of collective I/O implementations on parallel architectures. Journal of Parallel and Distributed Computing, 61(8):1052-1076, August 2001.

http://laurel.datsi.fi.upm.es/~gp/publications/europar96.ps.Z





http://www.computer.org/proceedings/hpdc/8579/85790155abs.htm

http://www.acm.org/pubs/articles/proceedings/supercomputing/263580/p28-chen/p28-chen.pdf



http://csdl.computer.org/comp/proceedings/ccgrid/2003/1919/00/19190104abs.htm

Bibliography Félix Garcia-Carballeira, Alejandro Calderon, Jesus Carretero, Javier

Fernandez, and Jose M. Perez. The design of the Expand parallel file system. The International Journal of High Performance Computing Applications, 17(1):21-38, 2003 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.

The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pages 96-108, Bolton Landing, NY, October 2003. ACM Press.

James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal. PPFS: A high performance portable parallel file system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 22, pages 330-343. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok Choudhary, and David Bernholdt. An experimental evaluation of I/O optimizations on different applications. IEEE Transactions on Parallel and Distributed Systems, 13(7):728-744, July 2002.

Mahmut Kandemir. Compiler-directed collective I/O. IEEE Transactions on Parallel and Distributed Systems, 12(12):1318-1331, December 2001.

http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf

http://www.buyya.com/superstorage/


Bibliography Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and Shengke Yu.

Improving MPI IO output performance with active buffering plus threads. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, April 2003.

Tara M. Madhyastha and Daniel A. Reed. Learning to classify parallel input/output access patterns. IEEE Transactions on Parallel and Distributed Systems, 13(8):802-813, August 2002.

Ethan L. Miller and Randy H. Katz. RAMA: An easy-to-use, high-performance parallel file system. Parallel Computing, 23(4-5):419-446, June 1997.

Bill Nitzberg and Virginia Lo. Collective buffering: Improving parallel I/O performance. In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing, pages 148-157, Portland, OR, August 1997. IEEE Computer Society Press.See also later version nitzberg:bcollective. Huseyin Simitci and Daniel Reed. A comparison of logical and physical

parallel I/O patterns. The International Journal of High Performance Computing Applications, 12(3):364-380, Fall 1998.

http://drl.cs.uiuc.edu/pubs/abt.pdf


Bibliography Domenico Talia and Pradip K. Srimani.

Parallel data-intensive algorithms and applications. Parallel Computing, 28(5):669-671, May 2002.

Len Wisniewski, Brad Smisloff, and Nils Nieuwejaar. Sun MPI I/O: Efficient I/O for parallel applications. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press

K. K. Lee, M. Kallahalla, B. S. Lee, and P. J. Varman. Performance comparison of prefetching and placement policies for parallel I/O. International Journal of Parallel and Distributed Systems and Networks, 5(2):76-84, 2002.

M. Kallahalla and P. J. Varman. PC-OPT: Optimal offline prefetching and caching for parallel I/O systems. IEEE Transactions on Computers, 51(11):1333-1344, November 2002.

http://www.elsevier.com/gej-ng/10/35/21/60/57/27/abstract.html

http://www.sc99.org/proceedings/papers/wisniew.pdf

http://www.actapress.com/journals/toc/toc2042002.htm



http://www.computer.org/tc/tc2002/tytoc.htm



JUNK !

SCF 1.1 – Efficient Interface and Prefetching

SCF 3.0 – effect of balanced I/O

FFT – effect of layout optimization

BTIO – effect of collective I/O

AST – effect of collective I/O

Collective Buffering: (Nitzberg et.al.: HPDC 97) Mapping problem between memory layout and

physical layout

There is mismatch in terms of data distribution, individual units of data accesses and the order of accesses between memory and file

Canonical File Canonical file or sequence of file bytes are

usually distributed in cyclic manner in parallel file systems

Collective buffering techniques Compromising network traffic to

reduce disk latencies Collective buffering performance

depends on: - intermediate data distribution - efficiency of permutations - number of nodes used for

permutation - buffer sizes used on the nodes

Other issues Background thread initiated during first

collective I/O Interesting termination

During MPI_FILE_CLOSE, a special buffer with special tag is appended to the buffer space. On seeing this, the background thread terminates.

ABT implemented with ROMIO I/O cost lesser than that of ROMIO During reads, the write buffers are checked

and written to disks

Results

Compiler analysis How do the program components access

data? If access by storage pattern, then store it in

that fashion and use independent parallel I/O If not, find majority access pattern. Use it as

storage pattern. For those components that do not adhere to

this access pattern, use collective I/O For others, use independent parallel I/O

Compiler analysis Weighted communication

graph(WCG) A node represents code block (data

is kept in memory) Between code blocks, data is stored

to disks An edge between node 1 and 2 iff

there is a data set produced in node 1 and consumed in node 2

Weight on the edge represents the number of transitions

Also define producers, consumers for each data set

Determine access patterns in consumers and appropriately determine the storage pattern in the producer

Strategy

1. Determine access patterns2. Determine storage patterns3. Decide on I/O strategy4. Rewrite the code using appropriate MPI I/O

calls

Steps Access pattern detection – by

loop index analysis and representative access pattern using the number of references through a particular access pattern

Storage layout detection algorithm – used Producer-Consumer subgraphs (PCSs) of WCG

I/O insertion

Results Version 1 –

independent parallel I/O for all

Version 2 – collective I/O for all

Version 3 – selective collective I/O

Experimental evaluation of I/O optimizations – Kandaswamy et. al (IPDPS: 2002)

5 different I/O apps. considered 5 software optimizations studied – collective

I/O, prefetching, file layout, efficient I/O interface, balanced I/O optimizations

Experiments carried with different I/O nodes on Intel Paragon and IBM SP2

Summary

Guidelines

Documents

Parallel I/O...Parallel and Distributed Systems, 13(7):728-744, July 2002. High Performance with Derived Data Types (Thakur et. al: SC 98) Potential of parallel file systems not fully