26
Cracow, 16 October 2006 Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters Paweł Pisarczyk [email protected] Jarosław Węgliński [email protected]

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

  • Upload
    waldo

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters. Paweł Pisarczyk [email protected] Jarosław Węgliński [email protected]. Cracow, 16 October 2006. Agenda. Introduction HPC c luster interconnects Message propagation model - PowerPoint PPT Presentation

Citation preview

Page 1: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Cracow, 16 October 2006

Performance Evaluation of Gigabit Ethernet-Based

Interconnects for HPC Clusters

Paweł Pisarczyk [email protected]

Jarosław Węgliński [email protected]

Page 2: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Agenda

Introduction

HPC cluster interconnects

Message propagation model

Experimental setup

Results

Conclusions

Page 3: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Who we are

joint stock company

founded in 1994, earlier (since 1991) as a departmentwithin PP ATM

IPO in September 2004 (Warsaw Stock Exchange)

major shares owned by founders (Polish citizens)

no state capital involved

financial data

stock capital about €6 million

2005 sales €29,7 million

about 230 employees

Page 4: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Mission

building business value through innovative information & communication technology initiatives creating new markets in Poland and abroad

ATM's competitive advantage is based on combining three key competences:

integration of comprehensive IT systems

telecommunication services

consulting and software development

Page 5: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Achievements

1991 Poland’s first company connected to Internet

1993 Poland’s first commercial ISP

1994 Poland’s first LAN with ATM backbone

1994 Poland’s first supercomputer on the Dongarra’s Top 500 list

1995 Poland’s first MAN in ATM technology

1996 Poland’s first corporate network with voice & data integration

2000 Poland’s first prototype Interactive TV system over a public network

2002 Poland’s first validated MES system for a pharmaceutical factory

2003 Poland’s first commercial, public Wireless LAN

2004 Poland’s first public IP content billing system

Page 6: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Client base

telecommunications40,4%

finance17,7%

academia9,6%

media5,9%

utilities5,7%

manufacturing4,9%

public4,4%

other11,4%

(based on 2005 sales revenues)

Page 7: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

HPC clusters developed by ATM

2004 - Poznan Supercomputing and Networking Center238 Itanium2 CPU, 119 x HP rx2600 nodes with Gigabit Ethernet interconnect

2005 - University of Podlasie 34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit Ethernet interconnect and Lustre 1.2 filesystem

2005 - Poznan Supercomputing and Networking Center86 dual core Opteron CPU, 42 Sun SunFire v20z and 1 Sun SunFire v40z with Gigabit Ethernet interconnect

2006 - Military University of Technology Faculty of Engineering, Chemistry and Applied Physics

32 Itanium2 CPU, 16 x HP rx1620 with Gigabit Ethernet interconnect

2006 Gdansk University of Technology: Department of Pharmaceutial Technology  and Chemistry

22 Itanium2 CPU (11 x HP RX1620) with Gigabit Ethernet interconnect

Page 8: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Selected software projects related to distributed systems

Distributed Multimedia Archive in Interactive Television (iTVP) Project

scalable storage for iTVP platform with ability to process the stored content

ATM Objectsscalable storage for multimedia content distribution platform

system for Cinem@n company (founded by ATM and Monolith)

Cinem@n will introduce high-quality movies, news and entertainment digital content distribution services

Spread Screens Managerplatform for POS TV

system is currently used by Zabka (shopping network) and Neckermann (travel service)

about 300 of terminals presenting the multimedia content located in many polish cities

Page 9: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Selected current projects

ATMFSdistributed filesystem for petabyte scale storage based on COTS

based on variable-sized chunks

advanced replication and enhanced error detection

dependability evaluation based on software fault injection technique

FastGigRDMA stack for Gigabit Ethernet-based interconnect

message passing latency reduction

increases the application performance

Page 10: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Uses of computer networks in HPC clusters

Exchange of messages between cluster nodes to coordinate distributed computation

requires high maximal throughput and also low latency

inefficiency observed when the time consumed in single computation step is comparable to the message passing time

Access to shared data through network or cluster file system

requires high bandwidth when transferring data in blocks of defined size

filesystem and storage drivers are trying to reduce number of i/o operations issued (by buffering data and aggregating transfers)

Page 11: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Comparison of characteristics of interconnect technologies

* Brett M. Bode, Jason J. Hill, and Troy R. Benjegerdes “Cluster Interconnect Overview” Scalable Computing Laboratory, Ames Laboratory

Page 12: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Gigabit Ethernet interconnect characteristic

Popular technology for low cost cluster interconnects

Satisfied throughput for long frames (1000 bytes and longer)

High latency and low throughput for small frames

Those drawbacks are mostly caused by construction of existing network interfaces

What is the influence of the network stack implementation for the communication latency?

Page 13: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Message propagation model

Latency between transferring message to/from MPI library and transferring data to/from stack

Time difference between sendto/recvfrom function and driver start_xmit/interrupt functions

Execution time of driver functions

Processing time of the network interface

Propagation latency and latency introduced by active network elements

Page 14: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Experimental setup

Two HP rx2600 servers2 x Intel Itanium2 1.3 MHz 3MB cache

Debian GNU/Linux Sarge 3.1 operating system (kernel 2.6.8-2-mckinley-smp)

Gigabit Ethernet interfacesBroadcom BCM5701 chipset connected using PCI-X device bus

In order to eliminate possibility of additional delays, which may be introduced by external active network devices, servers were connected using crossover cables

Two NIC drivers were tested: tg3 (polling NAPI dirver), bcm5700 (interrupt driven driver)

Page 15: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Tools used for measures

NetPipe package for measuring throughput and latency for TCP and several implementations of MPI

For low level testing test programs working directly on Ethernet frames were developed

Testing programs and NIC drivers were modified to allow measuring, inserting and transfer of timestamps

Page 16: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Throughput characteristic for tg3 driver

Page 17: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Latency characteristic for tg3 driver

Page 18: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Results for tg3 driver

The overhead introduced by MPI library is relatively low

There is a big difference between transmission latencies in the ping-pong and streaming mode

The latency introduced for small frames is similar to latency introduced by 115kbps UART (in the case of transmitting one byte only)

We can deduce that there is some mechanism in the transmission path that delays transmission of single packets

What is the difference between NAPI and interrupt driven driver?

Page 19: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Interrupt driven driver vs NAPI driver (throughput characteristic)

Page 20: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Interrupt driven driver vs NAPI driver (latency characteristic)

Page 21: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Interrupt driven driver vs NAPI driver (latency characteristic) - details

Page 22: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Comparison of bcm5700 and tg3 drivers

Using default configuration, BCM5700 driver has worse characteristics than tg3

Interrupt driven version (default configuration) cannot achieve more than 650Mb/s of throughput for frames of any size

After interrupt coalescing disabling, the performance of BCM5700 driver have exceeded the results obtained by tg3 driver

Disabling of the polling can improve characteristics of the network driver, but NAPI is not the major cause of the transmission delay

Page 23: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Tools for message processing time measurement

Timestamps were inserted into the message eat each processing stage

Processing stages on the transmitter sidesendto() function

bcm5700_start_xmit() interrupt notifying frame transmit

Processing stages on the receiver sideinterrupt notifying frame receipt

netif_rx()

recvfrom() function

As high precision timer CPU clock cycle counter was used, (precision of 0.77ns = 1/1.3GHz)

Page 24: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Transmitter latency in streaming mode

Send Answer

17 us 17 us2 us

Page 25: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Distribution of delays in transmission path between cluster nodes

Page 26: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Conclusions

We estimate that RDMA based communication can reduce MPI message propagation time from 43μs to 23μs (doubling the performance for short messages)

There is also possibility of reducing T3 and T5 latencies by changing the configuration of the network interface (transmit and receive thresholds)

In the conducted research we didn’t consider differences between network interfaces (T3 and T5 delays may be longer or shorter than measured)

Latency introduced by switch is also omitted

FastGig project include not only communication library, but also measurement and communication profiling framework