View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Vijay Lakamraju Israel Koren
C. Mani Krishna
Low Overhead Fault Tolerant Networking (in Myrinet)
Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering
University of Massachusetts Amherst MA 01003
Motivation
An increasing use of COTS components in systems has been motivated by the need to Reduce cost in design and maintenance Reduce software complexity
The emergence of low cost, high performance COTS networking solutions e.g., Myrinet, SCI, FiberChannel etc.
The increasing complexity of network interfaces has renewed concerns about its reliability The amount of silicon used has increased
tremendously
The Basic Question
How can we incorporate fault toleranceinto a COTS network technology without greatly compromising its performance?
Microprocessor-based Networks
Most modern network technologies have processors in their interface cards that help to achieve superior network performance
Many of these technologies allow changes in the program running on the network processor
Such programmable interfaces offer numerous benefits: Developing different fault tolerance techniques Validating fault recovery using fault injection experimenting with different communication
protocols We use Myrinet as the platform for our study
Myrinet
Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology
At its core is a powerful RISC processor It is scalable to thousands of nodes Low latency communication (8 s) is achieved
through direct interaction with network interface (“OS bypass”)
Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware
Link and routing specifications are public & standard
Myrinet support software is supplied “open source”
Myrinet Configuration
HostProcessor
SystemBridge
SystemMemory
PCIDMA
PCIBridge
DMAEngine
HostInterface RISC
PacketInterface
SAN/LANConversion
LANai 9
LANai SRAMTimers
0 1 2
I/O Bus
Host Node
Hardware & Software
Programmable Interface
HostProcessor
SystemMemory
NetworkProcessor
LocalMemory
I/O Bus
Myrinet Card
Application
Myrinet Control Program
Middleware(e.g., MPI)
OSdriver
TCP/IPinterface
Susceptability to Failures
Dependability evaluation was carried out using software implemented fault injection Faults were injected in the Control Program (MCP)
A wide range of failures were observed Unexpected latencies and reduction of bandwidth The network processor can hang and stop responding A host system can crash/hang A remote network interface can get affected
Similar type of failures can be expected from other high-speed networks
Such failures can greatly impact the reliability/availability of the system
Summary of Experiments
More than 50% of the failures were host interface hangs
57.91205No Impact
1.1523Other Errors
0.439Host Computer Crash
3.165MCP Restart
12.7264Messages Dropped/Corrupted
24.6514Host Interface Hang
Failure Category Count % of Injections
Total 2080 100
Design Considerations
The faults must be detected and diagnosed as quickly as possible
The network interface must be up and running as soon as possible
The recovery process must ensure that no messages are lost or improperly received/sent Complete correctness should be achieved
The overhead on the normal running of the system must be minimal
The fault tolerance should be made as transparent to the user as possible
Fault Detection
Continuously polling the card can be very costly
We use a spare interval timer to implement a watchdog timer functionality for fault detection
We set the LANai to raise an interrupt when the timer expires
A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer
If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt
Fault Recovery Summary
The FATAL interrupt signal is picked by the fault recovery daemon on the host
The failure is verified through numerous probing messages
The control program is reloaded into the LANai SRAM
Any process that was accessing the board prior to the failure is also restored to its original state
Simply reloading the MCP will not ensure correctness
Myrinet Programming Model
Flow control is achieved through send and receive tokens
Myrinet software (GM) provides reliable in-order delivery of messages A modified form of “Go-Back-N” protocol is used Sequence numbers for the protocol are provided
by the MCP One stream of sequence numbers exists per
destination
Typical Control Flow
User process prepares messageUser process sets send token
LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process
User process handles notification eventUser process reuses buffer
Sender
User process provides receive bufferUser process sets recv token
LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process
User process handles notification eventUser process reuses buffer
Receiver
Duplicate Messages
User process prepares messageUser process sets send token
LANai sdmas messageLANai sends message
Driver reloads MCP into boardDriver resends all unacked messages
LANai sdmas message LANai sends message
Sender
User process provides receive bufferUser process sets recv token
LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process
User process handles notification eventUser process reuses buffer
Receiver
LANai goes down Lost ACK
Duplicate message LANai recvs message
Lack of redundant state information is the cause for this problem
ERROR!
Lost Messages
User process prepares messageUser process sets send token
LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process
User process handles notification eventUser process reuses buffer
Sender
User process provides receive bufferUser process sets recv token
LANai recvs messageLANai sends ACK
Receiver
LANai goes down
Driver reloads MCP into boardDriver sets all recv tokens again
LANai waits for message
Incorrect commit point is the cause of this problem
ERROR!
Fault Recovery
We need to keep a copy of the state information Checkpointing can be a big overhead Logging critical message information is enough
GM functions are modified so that A copy of the send tokens and the receive tokens is
made with every send and receive call The host processes provide the sequence numbers,
one per (destination node, local port) pair Copy of send and receive token is removed when
the send/receive completes successfully MCP is modified
ACK is sent out only after a message is DMAed to host memory
Performance Impact
The scheme has been integrated successfully into GM Over 1 man year for complete implementation
How much of the performance of the system has been compromised ? After all one can’t get a free lunch these days!
Performance is measured using two key parameters Bandwidth obtained with large messages Latency of small messages
Summary of Results
6.8 s6.0 sLANai-CPU utilization
1.15 s0.75 sHost-CPU utilization for receive
0.55 s0.3 sHost-CPU utilization for send
13.0 s11.5 sLatency
92 MHz92.4 MHzBandwidth
FTGMGMPerformance Metric
Host Platform: Pentium III with 256MB RedHat Linux 7.2
Summary of Results
Fault Detection Latency = 50 msFault Recovery Latency = 0.765 sPer-Process Latency = 0.50 s
Our Contributions
We have devised smart ways to detect and recover from network interface failures
Our fault detection technique for “network processor hangs” uses software implemented watchdog timers
Fault recovery time (including reloading of network control program) ~ 2 seconds
Performance impact is under 1% for messages over 1KB
Complete user transparency was achieved