High Performance Messaging onWorkstations:
Illinois Fast Messages (FM) for Myrinet
Scott Pakiny Mario Lauriaz Andrew Chieny
In most computer systems, software overhead dominatesthe cost of messaging, reducing delivered performance, espe-cially for short messages. Efficient software messaging layersare needed to deliver the hardware performance to the appli-cation level and to support tightly-coupled workstation clusters.
Illinois Fast Messages (FM) 1.0 is a high speed messag-ing layer that delivers low latency and high bandwidth for shortmessages. For 128-byte packets, FM achieves bandwidths of16.2 MB/s and one-way latencies 32s on Myrinet-connectedSPARCstations (user-level to user-level). For shorter packets,we have measured one-way latencies of 25 s, and for largerpackets, bandwidth as high as to 19.6 MB/s delivered band-width greater than OC-3. FM is also superior to the Myrinet APImessaging layer, not just in terms of latency and usable band-width, but also in terms of the message half-power point (n 1
which is two orders of magnitude smaller (54 vs. 4,409 bytes).We describe the FM messaging primitives and the critical
design issues in building a low-latency messaging layers forworkstation clusters. Several issues are critical: the division oflabor between host and network coprocessor, management ofthe input/output (I/O) bus, and buffer management. To achievehigh performance, messaging layers should assign as muchfunctionality as possible to the host. If the network interface hasDMA capability, the I/O bus should be used asymmetrically, with
The research described in this paper was supported in part by NSF grantsCCR-9209336 and MIP-92-23732, ONR grants N00014-92-J-1961 and N00014-93-1-1086 and NASA grant NAG 1-613. Andrew Chien is supported in part by NSFYoung Investigator Award CCR-94-57809.
yDepartment of Computer Science, University of Illinois at Urbana-Champaign,1304 W. Springfield Ave., Urbana, IL 61801, USA
zDipartimento di Informatica e Sistemistica, Universita di Napoli Federico II,via Claudio 21, 80125 Napoli, Italy
the host processor moving data to the network and exploitingDMA to move data to the host. Finally, buffer managementshould be extremely simple in the network coprocessor andmatch queue structures between the network coprocessor andhost memory. Detailed measurements show how each of thesefeatures contribute to high performance.
1 IntroductionAs the performance of workstations reaches hundreds of megaflops(even gigaflops), networks of workstations provide an increasinglyattractive vehicle for high performance computation . In fact, work-station clusters have a number of advantages over their major com-petitors (massively-parallel processors based on workstation proces-sors). These advantages can include lower cost, a larger softwarebase, and greater accessibility. Further, the advent of high perfor-mance network interconnects such as ATM , Fibre Channel ,FDDI , and Myrinet  present the possibility that workstationclusters can deliver good performance on a broader range of parallelcomputations.
Achieving efficient communication is the major challenge in syn-thesizing effective parallel machines from networks of workstations.Unfortunately, to date the most common messaging layers used forclusters (TCP/IP , PVM ) generally have not delivered a largefraction of the underlying communication hardware performance tothe applications. Reasons for this include protocol overhead, buffermanagement, link management, and operating system overhead.Even in recent high speed network experiments, high bandwidths aregenerally only achieved for large messages (hundreds of kilobytes oreven megabytes) and then only with overheads of 1 millisecond ormore. Reasons for this include system call overhead, buffer copying,network admission control, poor network management, and softwareoverhead . As a result, parallel computing on workstation clustershas largely been limited to coarse-grained applications.
Attempts to improve performance based on specialized hardwarecan achieve dramatically higher performance, but generally requirespecialized components and interfacing deep into a computer systemdesign [16, 18, 19]. This increases cost, and decreases the potentialmarket (and hence sale volume) of the network hardware.
The goal of the Illinois Fast Messages (FM) project is to delivera large fraction of the networks physical performance (latency andbandwidth) to the user at small packet sizes.1 Building efficient soft-
1More information and software releases of FM are available from:http://www-csag.cs.uiuc.edu/projects/communication/sw-messaging.html .
ware messaging layers is not a unique goal [12, 25, 30, 31], but FMis distinguished by its hardware context (Myrinet) and high perfor-mance.
The Fast Messages project focuses on optimizing the softwaremessaging layer that resides between lower-level communicationservices and the hardware. It is available on both the Cray T3D[22, 23] and Myricoms Myrinet . Using the Myrinet, FM providesMPP-like communication performance on workstation clusters. FMon the Myrinet achieves low-latency, high-bandwidth messaging forshort messages delivering 32s latency and 16 MBytes/s bandwidthfor 128 byte packets (user-level to user-level). For shorter packets,latency drops to 25s, and for larger packets, bandwidth rises to19.6 MB/s. This delivered bandwidth is greater than OC-3 ATMsphysical link bandwidth of 19.4 MB/s. FMs performance exceeds themessaging performance of commercial messaging layers on numer-ous massively-parallelmachines [21, 29, 11]. A good characterizationof a messaging layers usable bandwidth (bandwidth for short mes-sages) is n 1
2, the packet size to achieve half of the peak bandwidth
( r12 ). FM achieves an n 12 of 54 bytes. In comparison, Myricomscommercial API requires messages of over 3,873 bytes to achievethe same bandwidth. FM has improved the networks ability to deliverperformance to short messages dramatically, reducing n 1
two orders of magnitude.In the design of FM, we addressed three critical design issues
faced by all designers of input/output bus interfaced high speed net-works: division of labor between host and network coprocessor, man-agement of the input/output bus, and buffer management. To achievehigh performance, messaging layers should assign as much function-ality as possible to the host. This leaves the network coprocessor freeto service the high speed network channels. If the network interfacehas DMA capability, the input/output bus should be used asymmetri-cally, with the host processor moving data to the network and exploit-ing DMA to move data to the host. Using the processor to move datato the network reduces latency, particularly for small messages. DMAtransfer for incoming messages, initiated by the network coprocessor,maximizes receive bandwidth with little cost in latency. Finally, buffermanagement should be extremely simple in the network coprocessorand match queue structures between the network coprocessor andhost memory. Simple buffer management minimizes software over-head in the network coprocessor, again freeing the coprocessor toservice the fast network. Matching queue structures between the hostand network coprocessor allows short messages to be aggregatedin DMA operations, reducing the data movement overhead. Detailedmeasurements evaluate several design alternatives and show how
each of these achieves high performance.The rest of the paper is organized as follows. Section 2 describes
issues common to all messaging layers. Section 3 explains our FMdesign in light of the hardware constraints of a workstation cluster. InSection 4, we present the design and performance of FM elementsin detail, justifying each design decision with empirical studies. Wediscuss our findings in Section 5 and provide a brief summary of thepaper and conclusions in Section 6.
2 BackgroundFor some time, researchers and even production sites have been us-ing workstation clusters for parallel computation. Many libraries areavailable to support such distributed parallel computing (PVM atop UDP or TCP  is perhaps the most popular). The communi-cation primitives in these libraries have typically exploited operatingsystem communication services, running atop 10 Mb/s Ethernet, ormore recently some higher speed physical media such as FDDI ,ATM  or Fibre Channel . While such facilities are useful forcoarse-grain decoupled parallelism, they suffer from high softwarecommunication overhead (operating system calls) and low achievedbandwidth (media limits or software overhead), and thus cannot sup-port more tightly coupled or finer-grained parallelism.
Higher performance messaging systems for workstation clustersoften bypass the operating system, mapping the network device in-terface directly into the user address space and accessing it directlyvia load/store operations. Protection can still be achieved at virtualaddress translation but sharing of communication resources is morecomplicated. Our FM layer uses this approach, mapping the Myrinetnetwork interface directly into the user address space. Note thateven with a memory-mapped interface accesses can still be expen-sive; in our Myrinet system, reading a network interface status fieldrequires 15 processor cycles. Some ATM systems provide memorymapped input/output bus interfaces, but achieving performance is stilla challenging proposition. For example, delivered bandwidths of 13 MB/s are typical . Achieving high performance requires carefulmanagement of the hardware resources by the software messaginglayer.
How deeply network interfaces will be integrated into a typicalsystem is a debate currently raging in the works