3
Scalable High Performance Computing in Wide Area Network Rashid Hassani Institute of computer science University of Rostock Rostock, Germany [email protected] Dissertation Advisor: Prof. Dr. Peter Luksch DOCTORAL DISSERTATION COLLOQUIUM EXTENDED ABSTRACT AbstractMany parallel applications in High Performance Computing have to communicate via wide area network (WAN), e.g., in a grid or cloud environment that spans multiple sites. Communication across WAN links slows down the application due to high latency and low bandwidth. Much of this overhead is due to the current implementations of the MPI (Message Passing Interface) standard. My project aims at improving WAN performance of MPI. Virtually most of today’s wide area MPI implementations rely on the TCP/IP protocol. I propose to replace it by an innovative concurrent multipath communication method (CMC-SCTP) and integrate it with Open MPI project, which will increase bandwidth and enhance fault resilience within the MPI protocol stack in WAN environment. I plan to make my research results available to the community within the scope of the Open MPI project. Keywords-component; HPC; Open MPI; MPI; SCTP; CMC- SCTP I. INTRODUCTION Today, High Performance Computing (HPC) means parallel computing. Massively Parallel Processors (MPPs) have large numbers of processors that are tightly coupled via a high- bandwidth, low-latency network. The majority of HPC systems nowadays are clusters, i.e., multiprocessor systems that use off- the-shelf components. They appear to the user as a single system. Computational Grids use multiple, distributed HPC systems to solve large-scale problems. Such a collection of HPC resources forms a virtual supercomputer, which typically is heterogeneous with respect to the architectures of the individual systems, the processors, and the interconnection network. Tightly coupled clusters or MPPs are connected over wide area network (WAN). Efficient operation of such a distributed system requires that the characteristics of communication and synchronization across WAN connections be addressed appropriately. The most general programming paradigm for such heterogeneous systems is message passing. The Message Passing Interface (MPI) standard provides the basis for portable programming of a large variety of parallel computer architectures, ranging from symmetric multiprocessors to MPPs, clusters, and computational grids [1][2]. TCP was available in the first public domain versions of MPI such as LAM/MPI [3] and MPICH [4] , and more recently the use of MPI with TCP has been extended to wide-area network, the Internet and computational grids environments that link together diverse, geographically distributed, computing resources. Most MPI implementations for distributed computing environments in wide area network rely on TCP, e.g., MPICH-G2 [5], PACX-MPI [6], FT-MPI [7], LAM/MPI. The main advantage of using an IP-based protocol (i.e., TCP/UDP) for MPI is portability and ease with which it can be used to execute MPI programs in diverse network environments. One well-known problem with using TCP in wide-area environments is the presence of large latencies and the difficulty in utilizing all of the available bandwidth [8]. MPI processes, in an application, tend to be loosely synchronized, even when they are not directly communicating with each other. If delays occur in one process, they can potentially have a ripple effect on other processes, which can be delayed as well. As a result, performance can be poor especially in the presence of network congestion and the resulting increased latency, which can cause processes to stall waiting for the delivery or receipt of a message. Another significant problem of MPI using TCP in wide area network is scalability, MPI processes typically establish a large number of TCP connections for communicating with the other processes in the application. This imposes a requirement on the operating system to maintain a large number of socket descriptors, which effect on application performance. 978-1-4673-2362-8/12/$31.00 ©2012 IEEE 684

[IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

  • Upload
    peter

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Scalable High Performance Computing in Wide Area

Network

Rashid Hassani

Institute of computer science

University of Rostock

Rostock, Germany

[email protected]

Dissertation Advisor: Prof. Dr. Peter Luksch

DOCTORAL DISSERTATION COLLOQUIUM

EXTENDED ABSTRACT

Abstract— Many parallel applications in High Performance

Computing have to communicate via wide area network (WAN),

e.g., in a grid or cloud environment that spans multiple sites.

Communication across WAN links slows down the application

due to high latency and low bandwidth. Much of this overhead is

due to the current implementations of the MPI (Message Passing

Interface) standard.

My project aims at improving WAN performance of MPI.

Virtually most of today’s wide area MPI implementations rely on

the TCP/IP protocol. I propose to replace it by an innovative

concurrent multipath communication method (CMC-SCTP) and

integrate it with Open MPI project, which will increase

bandwidth and enhance fault resilience within the MPI protocol

stack in WAN environment. I plan to make my research results

available to the community within the scope of the Open MPI

project.

Keywords-component; HPC; Open MPI; MPI; SCTP; CMC-

SCTP

I. INTRODUCTION

Today, High Performance Computing (HPC) means parallel computing. Massively Parallel Processors (MPPs) have large numbers of processors that are tightly coupled via a high-bandwidth, low-latency network. The majority of HPC systems nowadays are clusters, i.e., multiprocessor systems that use off-the-shelf components. They appear to the user as a single system. Computational Grids use multiple, distributed HPC systems to solve large-scale problems. Such a collection of HPC resources forms a virtual supercomputer, which typically is heterogeneous with respect to the architectures of the individual systems, the processors, and the interconnection network. Tightly coupled clusters or MPPs are connected over wide area network (WAN). Efficient operation of such a distributed system requires that the characteristics of

communication and synchronization across WAN connections be addressed appropriately. The most general programming paradigm for such heterogeneous systems is message passing. The Message Passing Interface (MPI) standard provides the basis for portable programming of a large variety of parallel computer architectures, ranging from symmetric multiprocessors to MPPs, clusters, and computational grids [1][2]. TCP was available in the first public domain versions of MPI such as LAM/MPI [3] and MPICH [4] , and more recently the use of MPI with TCP has been extended to wide-area network, the Internet and computational grids environments that link together diverse, geographically distributed, computing resources. Most MPI implementations for distributed computing environments in wide area network rely on TCP, e.g., MPICH-G2‎ [5], PACX-MPI [6], FT-MPI [7], LAM/MPI. The main advantage of using an IP-based protocol (i.e., TCP/UDP) for MPI is portability and ease with which it can be used to execute MPI programs in diverse network environments. One well-known problem with using TCP in wide-area environments is the presence of large latencies and the difficulty in utilizing all of the available bandwidth [8]. MPI processes, in an application, tend to be loosely synchronized, even when they are not directly communicating with each other. If delays occur in one process, they can potentially have a ripple effect on other processes, which can be delayed as well. As a result, performance can be poor especially in the presence of network congestion and the resulting increased latency, which can cause processes to stall waiting for the delivery or receipt of a message. Another significant problem of MPI using TCP in wide area network is scalability, MPI processes typically establish a large number of TCP connections for communicating with the other processes in the application. This imposes a requirement on the operating system to maintain a large number of socket descriptors, which effect on application performance.

978-1-4673-2362-8/12/$31.00 ©2012 IEEE 684

Page 2: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

A. Open MPI

Open MPI [9] introduces a very good platform for projects that provide a new solution to particular problems in message passing. The modular structure of the open-source software allows us to integrate new solutions with limited implementation effort. Therefore, new approaches can be evaluated quickly in a real-world environment. As shown Figure 1, two main component frameworks in the point to point communication system of Open MPI are PTL (Point-to-point Transport Layer) and PML (Point-to-point Management Layer). A recent Open MPI project called TEG [10] represents an evaluation of LA-MPI point to point system. TEG provides a fault-tolerant point-point communication module in Open MPI which maximizes the bandwidth utilization on the connections, but it suffers from Ping-Pong latency for heterogeneous networks and operating systems environments.

Figure 1. CMC integrated Open MPI's layered architecture

B. Replacing TCP by SCTP in the MPI protocol stack

Currently, most wide area implementations of MPI rely on TCP. However, we believe that TCP is not well suited for MPI applications, especially in wide area network, because TCP is not message-oriented, has strict in-order delivery, vulnerable to certain denial-of-service attacks and most importantly has no built-in support for multi homed hosts (host with multiple IP addresses). SCTP [11] is message oriented like UDP but has TCP-like connection management, congestion and flow control mechanisms. In SCTP, there is an ability to define streams that allows multiple independent and reliable message sub-flows inside a single association. SCTP introduces several new features such as multi-homing and multi-streaming that makes it interesting for use in clusters. We believe most of the TCP problems can be solved if TCP is replaced by SCTP in the protocol stack.

SCTP associations and its stream feature closely match the message-ordering semantics of MPI when messages with the same tag, source rank and context are used to define a stream (tag) within an association (source rank). Therefore, integration of the Open MPI middleware and the SCTP protocol will become easier.

One of our group members introduced a new approach for Concurrent Multipath Communication over standard SCTP protocol as an alternative for CMT [12], which is called CMC-SCTP. CMC-SCTP [13] provides an end-to-end fast and efficient use of available communication paths simultaneously.

This method uses the fastest path (i.e. minimum round-trip-time) for exchanging the control and coordination data between the communicated nodes. This causes minimum communication reaction time, which provides minimum communication delay but with higher bandwidths up to the summation of all available communication interfaces.

II. OBJECTIVES AND WORK SCHEDULE

The main objective of the proposed project is to improve performance and scalability of communication in wide area network for HPC applications. This is to be achieved by providing a wide area MPI that is based on our CMC-SCTP extension protocol. I choose Open MPI because its modular structure supports easy integration of new modules and allows us to use arbitrary MPI parallel programs as case studies. I will develop a real-world environment as well as a simulation environment for evaluation. CMC-SCTP will be integrated into the modular structure of Open MPI in order to provide an infrastructure for in-depth evaluation using real-world programs. We call this project as CSM (Concurrent Multipath Communication SCTP for Open MPI). Open MPI supports multipath through message striping across multiple NICs. On the other hand, SCTP associations and its stream feature closely match the message-ordering semantics of MPI when messages with the same tag, source rank and context are used to define a stream (tag) within an association (source rank). Therefore, integration of the Open MPI middleware and the SCTP protocol will become easier. To achieve this goal, two test beds will be developed that allow us to evaluate our concepts in detail.

On a first step, I will develop a real-world test bed by integrating our new protocols into the Open MPI framework. On a second step, I will build a simulation-based test bed that allows us to investigate scenarios that cannot be set up in reality.

As shown in Figure 1, on a first step, the new SCTP module will be integrated as just another transport layer protocol into the PTL module of Open MPI. Before this can be done, however, some interfaces inside the Open MPI architecture need to be modified. Then adding CMC (Concurrent Multipath Communication) features into SCTP based Open MPI middleware (PML modules will be implemented according to CSM functionality) in order to provide concurrent multipath communication facilities. Our new modules are indicated as blue rectangles with bold italic text. The result is an Open MPI version that can use CMC-SCTP as its transport layer protocol. Later on Performance of CSM will be evaluated with parallel benchmark suites. To assess the benefits of CMC-SCTP's multi-homing feature, MPI bandwidth benchmark programs will be used and the performance metrics will be recorded for execution with CMC-SCTP based Open MPI and execution with TCP-based Open MPI. We will now be able to use arbitrary MPI programs to evaluate the effect of our new protocol on performance.

On a second step, the real-world environment will be complemented by a simulation-based environment that allows us to evaluate aspects that cannot be addressed appropriately in a real-world environment. We can investigate how parallel

685

Page 3: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

programs behave when we assume a network with characteristics that are outside the scope of today's real-world network fabric. Simulation can also give a prediction in performance improvement that can be obtained by adding additional network interfaces to existing compute nodes in a cluster.

III. CONCLUSION

Implementation of SCTP based middleware in Open MPI which supports both multi-homing and multi-streaming features is a strong motivation for continued research in IP-based transport protocol support for MPI. In this project, I propose the use of Concurrent Multipath Communication for SCTP as a robust transport protocol extension for transferring data simultaneously over multi-paths in order to improve bandwidth, fault tolerant, scalability and specially reducing overall communication delay for MPI in wide-area network.

Parallel processing techniques are algorithmic and code-structuring methods that enable the parallelization of program functions. The coming availability of large-scale grid systems through the adoption of Cloud architecture creates an opportunity to apply these techniques to application system design. Since Cloud computing shares characteristics with Grid computing therefore it can accommodate High Performance Computing and is now looking to be a new business model in a large way. Many companies use cloud-based servers and software, which are remote networks of computing resources often shared by multiple customers. This strategy enables enterprises to cut PC operating costs by improving the data flow and reducing the number of hardware components needed to run applications on the company's internal data center and replacing them with cloud computing systems which reduce energy for running and cooling hardware. It also enables company IT staff to focus on projects that are strategic to their business instead of individually managing each computer or clusters. Cloud computing architecture highlights the need for systems to operate in a highly dynamic grid environment forcing parallel processing techniques to be incorporated into mainstream programs. Therefore, I hope my proposed project could provide a basis motivation for future researches in the fields of Cloud computing in order to have a single, comprehensive product that solves many pain points such as

scalability, operation cost and communication delay in organizations of all sizes.

CSM will be a point-to-point concurrent multi-path communication module in Open MPI that supports multi-homing and the ability to stripe and share transferred data across multiple available interfaces.

Finally, I hope my work will be incorporated into Open MPI project as an efficient parallelism approach for applications to leverage the massive amounts of data available from the Web, social networks and large scale systems.

REFERENCES

[1] Message Passing Interface Forum. MPI: A Message Passing Interface. In

Supercomputing ’93: pages 878-883. IEEE Computer Society Press, 1993.

[2] Al Geist, William Gropp, and Marc Snir. MPI-2: Extending the message-passing interface. In Europar, Vol. I, pages 128-135, 1996.

[3] The LAM/MPI Team Open Systems Lab. LAM/MPI user’s guide, Pervasive Technology Labs, Indiana University, February 2007.

[4] http://www.mcs.anl.gov/research/projects/mpich2, accessed on 25/04/2012.

[5] Nicholas T. Karonis, Brian R. Toonen, and Ian T. Foster. MPICH-G2: A grid-enabled implementation of the message passing interface. CoRR, cs.DC/0206040, 2002.

[6] Rainer Keller, Edgar Gabriel, Towards efficient execution of MPI applications on the grid: Porting and optimization issues. Journal of Grid Computing, 1:133-149, June 2003.

[7] Graham E. Fagg and Jack J. Dongarra. Building and using a fault-tolerant MPI implementation. 18(3):353-361, 2004.

[8] Supratik Majumder, Scott Rixner, Comparing Ethernet and Myrinet for MPI Communication, Los Alamos National Laboratory, 2006.

[9] http://www.open-mpi.org, accessed on 25/04/2012

[10] T.S. Woodall, R.L. Graham, and A. Lumsdaine. TEG: A high-performance, scalable, multi-network point-to-point communications methodology, Hungary, September 2004.

[11] Randall R. Stewart, Stream control transmission protocol (SCTP): RFC 4960, 2007.

[12] M. Becke, M. Tuexen, Ed., Load Sharing for the Stream Control Transmission Protocol (SCTP), Network Working Group, 2011.

[13] Abbas Malekpour, D. Tavangarian, "Concurrent Multipath Communication for SCTP a Novel Method for Data Transmission", I2TS´2010-Rio de Janeiro, Brazil, 13-15/Dec/2010

686