Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department

Vijay Lakamraju Israel Koren

C. Mani Krishna

Low Overhead Fault Tolerant Networking (in Myrinet)

Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering

University of Massachusetts Amherst MA 01003

Motivation

An increasing use of COTS components in systems has been motivated by the need to Reduce cost in design and maintenance Reduce software complexity

The emergence of low cost, high performance COTS networking solutions e.g., Myrinet, SCI, FiberChannel etc.

The increasing complexity of network interfaces has renewed concerns about its reliability The amount of silicon used has increased

tremendously

The Basic Question

How can we incorporate fault toleranceinto a COTS network technology without greatly compromising its performance?

Microprocessor-based Networks

Most modern network technologies have processors in their interface cards that help to achieve superior network performance

Many of these technologies allow changes in the program running on the network processor

Such programmable interfaces offer numerous benefits: Developing different fault tolerance techniques Validating fault recovery using fault injection experimenting with different communication

protocols We use Myrinet as the platform for our study

Myrinet

Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology

At its core is a powerful RISC processor It is scalable to thousands of nodes Low latency communication (8 s) is achieved

through direct interaction with network interface (“OS bypass”)

Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware

Link and routing specifications are public & standard

Myrinet support software is supplied “open source”

Myrinet Configuration

HostProcessor

SystemBridge

SystemMemory

PCIDMA

PCIBridge

DMAEngine

HostInterface RISC

PacketInterface

SAN/LANConversion

LANai 9

LANai SRAMTimers

0 1 2

I/O Bus

Host Node

Hardware & Software

Programmable Interface

HostProcessor

SystemMemory

NetworkProcessor

LocalMemory

I/O Bus

Myrinet Card

Application

Myrinet Control Program

Middleware(e.g., MPI)

OSdriver

TCP/IPinterface

Susceptability to Failures

Dependability evaluation was carried out using software implemented fault injection Faults were injected in the Control Program (MCP)

A wide range of failures were observed Unexpected latencies and reduction of bandwidth The network processor can hang and stop responding A host system can crash/hang A remote network interface can get affected

Similar type of failures can be expected from other high-speed networks

Such failures can greatly impact the reliability/availability of the system

Summary of Experiments

More than 50% of the failures were host interface hangs

57.91205No Impact

1.1523Other Errors

0.439Host Computer Crash

3.165MCP Restart

12.7264Messages Dropped/Corrupted

24.6514Host Interface Hang

Failure Category Count % of Injections

Total 2080 100

Design Considerations

The faults must be detected and diagnosed as quickly as possible

The network interface must be up and running as soon as possible

The recovery process must ensure that no messages are lost or improperly received/sent Complete correctness should be achieved

The overhead on the normal running of the system must be minimal

The fault tolerance should be made as transparent to the user as possible

Fault Detection

Continuously polling the card can be very costly

We use a spare interval timer to implement a watchdog timer functionality for fault detection

We set the LANai to raise an interrupt when the timer expires

A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer

If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt

Fault Recovery Summary

The FATAL interrupt signal is picked by the fault recovery daemon on the host

The failure is verified through numerous probing messages

The control program is reloaded into the LANai SRAM

Any process that was accessing the board prior to the failure is also restored to its original state

Simply reloading the MCP will not ensure correctness

Myrinet Programming Model

Flow control is achieved through send and receive tokens

Myrinet software (GM) provides reliable in-order delivery of messages A modified form of “Go-Back-N” protocol is used Sequence numbers for the protocol are provided

by the MCP One stream of sequence numbers exists per

destination

Typical Control Flow

User process prepares messageUser process sets send token

LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process

User process handles notification eventUser process reuses buffer

Sender

User process provides receive bufferUser process sets recv token

LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process


Receiver

Duplicate Messages


LANai sdmas messageLANai sends message

Driver reloads MCP into boardDriver resends all unacked messages

LANai sdmas message LANai sends message

Sender


LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process


Receiver

LANai goes down Lost ACK

Duplicate message LANai recvs message

Lack of redundant state information is the cause for this problem

ERROR!

Lost Messages


LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process


Sender


LANai recvs messageLANai sends ACK

Receiver

LANai goes down

Driver reloads MCP into boardDriver sets all recv tokens again

LANai waits for message

Incorrect commit point is the cause of this problem

ERROR!

Fault Recovery

We need to keep a copy of the state information Checkpointing can be a big overhead Logging critical message information is enough

GM functions are modified so that A copy of the send tokens and the receive tokens is

made with every send and receive call The host processes provide the sequence numbers,

one per (destination node, local port) pair Copy of send and receive token is removed when

the send/receive completes successfully MCP is modified

ACK is sent out only after a message is DMAed to host memory

Performance Impact

The scheme has been integrated successfully into GM Over 1 man year for complete implementation

How much of the performance of the system has been compromised ? After all one can’t get a free lunch these days!

Performance is measured using two key parameters Bandwidth obtained with large messages Latency of small messages

Latency

Bandwidth

Summary of Results

6.8 s6.0 sLANai-CPU utilization

1.15 s0.75 sHost-CPU utilization for receive

0.55 s0.3 sHost-CPU utilization for send

13.0 s11.5 sLatency

92 MHz92.4 MHzBandwidth

FTGMGMPerformance Metric

Host Platform: Pentium III with 256MB RedHat Linux 7.2

Summary of Results

Fault Detection Latency = 50 msFault Recovery Latency = 0.765 sPer-Process Latency = 0.50 s

Our Contributions

We have devised smart ways to detect and recover from network interface failures

Our fault detection technique for “network processor hangs” uses software implemented watchdog timers

Fault recovery time (including reloading of network control program) ~ 2 seconds

Performance impact is under 1% for messages over 1KB

Complete user transparency was achieved

Documents

Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department