14
CONCURRENCY PRACTICE AND EXPERIENCE, VOL. 8(6), 41 5-428 (JULY-AUGUST 1996) The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing KEVIN B. THEOBALD Advanced Compilers. Architectures and Parallel Systems (ACAPS) School of Computer Science McGill University, Montrdal, Canada SUMMARY Large-scale multiprocessors require an efficient interconnection network to achieve good per- formance. This network, like the rest of the system, should befuulf-tolcrunt (able to continue operating even when there are hardware failures). This paper presents the W-Network, a low- cost fault-tolerantMIN which is well-suited to a large multiprocessorrunning fine-grain parallel programs. It tolerates all single faults without any increases in latency or decreases in band- width following a fault, because it behavesjust like the fault-free network even when there is a single fault. It requires only one extra port per chip, which makes it practical for a VLSI implementation. In addition, extra ports can be added for replacing faulty processors with spares. 1. INTRODUCTION For large-scale fine-grained multiprocessing,an efficient interconnection network is crucial to good performance. Fault tolerance, the ability to function correctly even after some components in the system have failed, is also important for such systems. Ideally, the interconnection network in a massively parallel machine should be able to maintain full connectivity in the event of a fault (component failure), and should be able to perform at or near its normal level of performance even after the fault has occurred. The amount of redundant hardware needed to provide this fault tolerance should be kept to a minimum. This paper presents the design of the W-Network, a fault-tolerant multistage intercon- nection network (FTMIN) which is well-suited to a multiprocessor, especially one running fine-grained parallel programs with non-local communications. The W-Network tolerates all single faults, yet requires little additional hardware and suffers almost no loss in per- formance. In addition, one can easily add the ability to replace faulty processors with spares. 1.1. Interconnection networks for fine-grain parallelism This paper is primarily concerned with networks for large-scale parallel machines exploit- ing fine-grain parallelism, such as dataflow architectures[ 11 or the EARTH multithreaded project[2]. Fine-grain parallelism is important to such machines, because in many applica- tions, large grain sizes constrain parallelism[3]. As more processors are added to parallel machines, the trend has been for computation grain sizes to decrease[4]. If grain sizes are small, then the messages passed between them will be short and frequent, on average. CCC 1040-3108/96/060415-14 01996 by John Wiley & Sons, Ltd. Received November 1995

The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

  • Upload
    kevin-b

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

CONCURRENCY PRACTICE AND EXPERIENCE, VOL. 8(6), 41 5-428 (JULY-AUGUST 1996)

The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing KEVIN B. THEOBALD Advanced Compilers. Architectures and Parallel Systems (ACAPS) School of Computer Science McGill University, Montrdal, Canada

SUMMARY Large-scale multiprocessors require an efficient interconnection network to achieve good per- formance. This network, like the rest of the system, should befuulf-tolcrunt (able to continue operating even when there are hardware failures). This paper presents the W-Network, a low- cost fault-tolerant MIN which is well-suited to a large multiprocessor running fine-grain parallel programs. It tolerates all single faults without any increases in latency or decreases in band- width following a fault, because it behaves just like the fault-free network even when there is a single fault. It requires only one extra port per chip, which makes it practical for a VLSI implementation. In addition, extra ports can be added for replacing faulty processors with spares.

1. INTRODUCTION

For large-scale fine-grained multiprocessing, an efficient interconnection network is crucial to good performance. Fault tolerance, the ability to function correctly even after some components in the system have failed, is also important for such systems. Ideally, the interconnection network in a massively parallel machine should be able to maintain full connectivity in the event of a fault (component failure), and should be able to perform at or near its normal level of performance even after the fault has occurred. The amount of redundant hardware needed to provide this fault tolerance should be kept to a minimum.

This paper presents the design of the W-Network, a fault-tolerant multistage intercon- nection network (FTMIN) which is well-suited to a multiprocessor, especially one running fine-grained parallel programs with non-local communications. The W-Network tolerates all single faults, yet requires little additional hardware and suffers almost no loss in per- formance. In addition, one can easily add the ability to replace faulty processors with spares.

1.1. Interconnection networks for fine-grain parallelism

This paper is primarily concerned with networks for large-scale parallel machines exploit- ing fine-grain parallelism, such as dataflow architectures[ 11 or the EARTH multithreaded project[2]. Fine-grain parallelism is important to such machines, because in many applica- tions, large grain sizes constrain parallelism[3]. As more processors are added to parallel machines, the trend has been for computation grain sizes to decrease[4]. If grain sizes are small, then the messages passed between them will be short and frequent, on average.

CCC 1040-3108/96/060415-14 01996 by John Wiley & Sons, Ltd.

Received November 1995

Page 2: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

416 K. B. THEOBALD

For such systems, a low-latency network is important[5]. Although there are techniques for tolerating high-latency communications, such as multithreading[2], achieving good processor utilization requires surplus parallelism, which may not always be available, espe- cially when there are many processors. Therefore, interprocessor communication systems have been moving toward low-latency networks transmitting short messages[3,6].

In many problems, the communication patterns can be highly variable. For this reason, the network should be packet-switched, meaning that each message contains a header that directs it to its destination. Performance can be improved with enhancements such as buffers in the switches to mitigate contention[7] and wormhole routing[8] to decrease latency.

1.2. Multistage interconnection networks

Multistage interconnection nefworkr (MINs) are attractive for large-scale multiprocessors because they offer a good balance of bandwidth, latency and cost[9]! MINs offer higher bandwidth than single buses, lower latency than meshes for non-local communications, and lower cost than crossbar switches and hypercubes. Delta networks[ 101 are a common class of MINs. A delta network has the following properties:

1. The fundamental building block is a switching element with b inputs and b outputs. This switch can transfer data from any input port to any output port.

2. The MIN contains n stages, connected in series, each with 6"-' b x b switches. There is a one-to-one connection between the outputs of the switches in one stage and the inputs of the switches in the next stage.

3. There is one unique path between a given network input and a given network output. This path goes through one switch in each stage. The path can be described by n radix-b digits, where each digit selects the output port of one of the n switches in the path. (In packet-switched networks this number is part of the message header.)

In this paper, all stages, switches within a stage, and ports are numbered left-to-right or top-to-bottom, beginning with 0. Stage 0 is connected to the input ports of the network.

There are many variations of the delta network which are topologically equivalent[ 1 11. The baseline network[ll] serves as an illustration. A b" x bn baseline network (see Figure l(a)) consists of a stage of b x b switches followed by b parallel 6"-' x bn-' networks. The b outputs of agiven switch in the initial stage are connected to the b"-' x b"-' networks, one output per network. A 16 x 16 network is built oui of 2 x 2 switches in Figure l(b).

1.3. Fault tolerance

Fault tolerance is an important consideration in the design of parallel computer systems. A massively parallel computer is likely to have thousands of IC chips. Even if the reliability of a single chip is high, the combination of so many chips leads to a much lower aggregate level of reliability for the whole system[ 121.

Before designing a fault-tolerant system, it is necessary to state the criteria for successful fault tolerance. Most reliability techniques involve non-linear costs wherein the last degree of reliability costs the most. System architects must weigh the costs of additional fault

This reference contains a list of systems which have been built using MINs.

Page 3: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

THE W-NETWORK 417

0 1

b- 1

b b+ 1

2b- 1

N-b

N- 1

+I

0 1 2 3 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

10 10 I t 11 12 12 13 13

14 14 15 15

a) General (N x N using b x b switches) b) 16 x 16 using 2 x 2 switches

Figure 1. The Baseline Network

tolerance against the potential costs of failures. Our designs are primarily intended for large number-crunching applications, where a failure could cause the loss of a lot of valuable computing time, but would not be catastrophic. Downtime should be avoided, but with an eye to economy; we are willing to sacrifice some reliability to reduce costs.

Thus, we will limit our design requirements to single-faulttolerance. We will assume that the probability of a second failure occurring before a failed component has been replaced is very low. It is likely, though, that single failures will occur many times during a system's lifetime, and that hours or days could elapse between the time a failure occurs and the time the system can be repaired. To avoid major losses of computing time, our goals include the requirement that there is little or no drop in system performance after a single failure.

We also assume our system is not gracefully-degradable. A gracefully-degradable system can remove failed processors from the system and migrate dislocated tasks to healthy processors. However, in many applications, an automatic redistribution of tasks could lead to large imbalances in the work load, slowing down the application significantly. Therefore, in our system, we assume the number of processors does not change at runtime.

Since the MIN itself is made of failure-prone components, it should also be fault-tolerant. A fault-tolerant MIN (FTMIN) for a non-gracefully-degradable system must retain full connectivity even when there are faults; it must be able to route a message from any source to any destination. Furthermore, the FI'MIN should provide its fault tolerance at low cost. This means two things. First, if any extra hardware is required, the amount should be small. Second, the addition of fault tolerance should not lead to a significant reduction in performance, either before or after a failure has occurred.

Unfortunately, many of the proposed designs for FTMINs fail at least one of these two criteria. Either a lot of extra hardware is required, or there are large losses in performance, or both. In some cases this is unavoidable, because the design is expected to withstand a large number of faults without suffering a systemic failure. In other cases, the designs were analyzed using assumptions that have been made obsolete by changing technology.

Page 4: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

418 K. B. THEOBALD

We assume that all components of the MIN, including switches and links, are subject to failure. (Unlike some network studies, we assume the outermost stages and links can fail as well.) If there is a failure of a link, between two stages or between the network and a processor, then no messages can travel along that link. Within a switching chip, a failure affecting a single port, such as a break in a bonding wire or a failure of an I/O driver, is equivalent to a failure in the corresponding link. A failure of the internal crossbar makes the entire switch unusable, and no useful messages may flow through any pins on that chip.

1.4. Previous work

There are many FTMIN designs from which to choose, and these designs have many tradeoffs involving cost, performance and fault tolerance. No single design is best for all applications, so the needs of the system are a major factor determining which one is used.

Most FTMMs can be viewed as a basic MIN design (such as a delta) augmented with extra switches and/or links to create redundant pathways that can be used to bypass failed components. For instance, both the Extra Stage Cube (ESC) network[ 131 and the Augmented Delta Network[ 141 add an extra stage of switches to a standard delta network, so that there are b pathways between two processors. The INDRA network[ 151 replicates the network k times, so it can tolerate k - 1 faults. The Augmented ShuDe-Exchange Network[ 161, F- network[ 171, Enhanced Inverse Augmented Data Manipulator[ 181, and Multibutte$y[ 191 add extra ports to each switch. The first network uses these extra ports to create links between switches within the same stage, while the others create auxiliary links between stages.

Some networks use a combination of approaches. The Extra Stage Gamma network[20] combines extra inter-stage links and an extra stage. The Dynamic Redundancy (DR) net- work[21] both increases inter-stage links and adds extra switches, but the switches are added to each stage rather than to an extra stage. In the DR, an arbitrary number of extra switches can be added in each stage, provided that the same number is added to each stage, and that the same number of spare processors are added and connected in the same way. It is possible to form an N-processor machine with a basic delta network from any set of Nl2 contiguous rows (the top and bottom rows are contiguous). One big advantage of the DR is that it includes a mechanism for attaching spare processors.

1.5. Synopsis

The remainder of this paper is divided into five Sections. The basic W-Network is presented in the next Section. Section 3 analyzes the redundancy costs of the network and discusses its performance. The design of experimental prototypes is covered in Section 4. Section 5 describes some possible extensions to the basic design, including how the W-Network can be used in a complete system to achieve processor fault tolerance as well. The final Section summarizes and concludes this paper.

2. THE W-NETWORK

The following presents thedesign of the W-Network. This FTMIN is a superset of a standard delta network. At any given time, only a subset of the links and switches is active, and this subset forms an ordinary delta network. The W-Network has low redundancy, mainly

Page 5: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

THE W-NETWORK 419

ISS oss a) Extra switches b) Port connections

Figure 2. W-Nemork Top-Level Diagram

because each spare component can substitute for as many as v% other components. It uses a simple routing algorithm, and suffers no performance degradation following reconfiguration to bypass a fault. The W-Network can be built using a small number of chip types. The general design is illustrated by building a 16 x 16 W-Network out of 2 x 2 switches.

2.1. Top-level design

While the baseline network (see Section 1.2.) is normally constructed tail-recursively, one can construct a topologically equivalent delta network using binary recursion, by connecting two stages of x f l subnetworks (e.g. a baseline network in which b = fl and n = 2). These stages are called thelnput SubnetworkStage (ISS) and the Output Subnetwork Stage (OSS). This network is modified to form the W-Network. First, an extra fi x fi subnetwork is added to each subnetwork stage. Next, each of the (n + 1) subnetworks in the ISS is augmented by one extra output port, and each subnetwork in the OSS is augmented by one extra input. Finally, output port k of subnetwork j in the ISS is connected to input port j of subnetwork k in the OSS (0 5 j , k < fi). This is shown in Figure 2(a).

Now each subnetwork stage contains aredundant v% x v% subnetwork. If one removes one subnetwork from each stage, and removes all links connected to the removed subnet- works, the result, topologically speaking, will be the original N x N delta network. This property is the basic idea behind the W-Network. If a switch or link fails, the subnetwork containing that element is declared faulty. (If a link between two subnetworks fails, either subnetworkcan be declared faulty.) If subnetworkj in the ISS is declared faulty, each of the subnetworks in the OSS can be configured to ignore signals from input port j , taking input

Page 6: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

420 K. B. THEOBALD

only from the other fi ports. Similarly, if subnetwork k in the OSS is declared faulty, each of the subnetworks in the ISS is configured to ignore output port k. One subnetwork in each half must be disconnected, so if all subnetworks are working properly, one subnetwork in each half is chosen at random to be ‘faulty.’

This leads to two questions:

1. How are N processors connected to f i( f i + 1) network ports (input or output)? 2. How is each subnetwork configured to ‘ignore’ one port?

2.2. Dual Ports

Like the rest of the hardware, the connections between a processor and the MIN are prone to failure. Some studies of fault-tolerant MINs simply assumed that these were fault-free, or even that the first and last stages of the network were fault-free. However, the possibility of faults in these links must be addressed in order to make the system fully fault-tolerant.

One way to handle faults in the processor-network connections is to provide two copies of each. Each processor has two input and two output connections to the network; at any given time, one of each is active. This was proposed[22] as a way to improve the fault- tolerance of the ESC network, which previously had assumed that some critical switches were fault-free.

The W-Network uses this scheme in a way which also settles the first question above. If the ports of the W-Network are numbered from 0 to fi(fi + 1) - 1 (since there are (fi + I ) subnetworks), then processor i is connected to ports i and i + f i (see Figure 2(b)).

This means that most of the inputs and outputs of the W-Network are connected to two processors each. Only one connection in each pair may be active at a time. In networks such as the ESC, dual connections are handled with explicit multiplexers and demultiplexers. However, it is possible to eliminate this extra hardware. Each input port in the ISS, unless it is in subnetwork 0 or fi, connects to two processor outputs. These outputs can be tied together, provided that one of them is left in the undriven state. Similarly, each output of the OSS (unless in subnetwork 0 or fi) is connected to two processors. Since these connections are input ports, one of the processors simply ignores the input on that port.

Referring to Figure 2, suppose subnetwork j in the ISS is declared faulty. Then all processors between 0 and f i j - 1 (inclusive) drive their top output links, while all other processors drive their bottom links. All other processor outputs are undriven. Thus, there is one connection between each processor and each port in the remaining f l subnetworks, and none of these connections conflict. Similarly, if subnetwork k in the OSS is declared faulty, then all processors between 0 and f i k - 1 (inclusive) only read their top input links, while all other processors read their bottom links. Again, there is a one-to-one correspondence between the outputs of non-faulty subnetworks and active processor inputs.

2.3. The W-Switch

The second of the two questions concerns the interconnections between the two halves of the W-Network. Each subnetwork contains f i ports on one side and f i + 1 ports on the other. However, internally, the subnetwork contains a simple fi x fi network. This can be a simple switch chip, or, if f l is too large, it can be built using any delta network

Page 7: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

*rm W-NETWORK 42 1

External 0 1 2 3 4 m - l m

0 1 2 3 m-2 m-1 Internal

a) General switch model

b) Universal W-Switch chip design c) A complete subnetwork

Figure 3. The W-Switch

topology, such as a baseline. But there must be a way to connect the fi ports of the delta network to the f l + 1 ports so that only v% of those ports are active.

The connection between the internal network and the outside is made with a W-Switch. Figure 3(a) shows a W-Switch for an m-port subnetwork or switch. Each internal port i can be connected to one of two external ports, either z or i + I .2 The W-Switch connects all of the internal ports to all of the external ports but one, maintaining the order of the internal ports. This switch configuration is a static setting which is only changed when the network must be reconfigured due to a fault. In Figure 3(a), external port 1 is isolated.

There are several ways in which a W-Switch could be incorporated into a switch chip. Essentially, b + 1 inputs at the chip boundary must be merged into b inputs of a crossbar switch, or b outputs of the crossbar must be sent to b + 1 chip outputs, which must have tri-state drivers so that one can be left undriven. The merging and splitting could be done, for instance, using pass gates. Since the settings of these multiplexers would be static during normal operation, there would be no setup time for the pass gates. The signals would only be delayed by the pass gate itself. In today’s technology, this would add about 1 or 2 ns per chip.

Since some switches need W-Switches at their inputs and others need W-Switches at their outputs, two separate chip designs could be made. However, the two chip types can be combined into a single design in which both sides of the internal crossbar are connected to W-Switches, and the extra port can be either an input or an output. The design for b = 4 is illustrated in Figure 3(b), in which the extra port is shown being used as an output.

If the subnetwork is bigger than one switch (0 > b ) , it would seem that separate hardware would be needed for the W-Switches. Fortunately, W-Switches in adjacent switch chips can be combined into a larger W-Switch by connecting external port b of one switch with external port 0 of the next switch. (Naturally, this means that only one of these two ports can be driven at a time.) Figure 3(c) shows how to make a 16 x 17 subnetwork

* The name ‘W-Switch’ comes from the shape formed by the superposition of all possible switch positions when rn = 2.

Page 8: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

422 K. B. THEOBALD

Active switch

I Inactive switch

Active link

Inactive link

Figure 4. A Complete Network

out of 4 x 4 switches by combining four W-Switches. The dashed lines indicate the chip boundaries in the second stage of the subnetwork (for simplicity, only the W-Switches at the switch outputs are shown).

Figure 4 illustrates the W-Network by showing a 16 x 16 network made out of 2 x 2 switches. The W-Switches are configured for bypassing subnetwork 4 of the ISS and subnetwork 1 of the OSS. Bold lines show the active 16 x 16 delta network, which is topologically equivalent to Figure l(b).

2.4. Message routing and reconfiguration

W-Switches are static; their settings remain unchanged until it is necessary to reconfigure the W-Network following detection of a fault. Thus, the dynamic routing of messages from source to destination is controlled entirely by the settings of the internal b x b crossbars. Since the active part of the network is identical to an N x N delta (assuming the W-Switches are properly set and all processors are drivinglreading the correct outputs/inputs),messages are routed in the same manner as in the delta network (see Section 1.2.). For instance, if the network in Figure 4 is packet-switched, a message is sent to outputj by telling the b x b switch in stage i to send the message to port number k, where k is bit (3 - i) of j. In contrast, many fault-tolerant MINs have complex rerouting algorithms, and some require that the processors modify the control headers used to indicate message destinations.

The W-Switch settings only need to be changed in response to a network fault. How the system determines that a fault occur, and how it diagnoses the fault are beyond the scope of this paper. To summarize, redundant information (e.g. parity bits) can be added to the messages to check for data faults. Messages can be counted at the sending and receiving ends, and the counts periodically compared, to check for routing faults[l]. When a fault is

Page 9: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

THE W-NETWORK 423

detected, it is necessary to perform diagnostic tests to determine the location of the fault and whether it is transient or permanent.

Once the system has determined the location of the faulty link or switch (and assuming that it has determined the fault to be permanent), it is only necessary to configure the W-Switch in each switch chip, and to tell each processor which network ports to use. This is done using the techniques described in the previous subsections. Reconfiguration should only occur rarely, so it does not need to happen quickly. Since the network would not be sending ordinary data during the diagnosis and reconfiguration phase, the control processor could send control packets (with special headers) for setting the W-Switches through the same network. No additional pins would be required on the switch chips for this function.

3. ANALSIS

The W-Network is capable of tolerating any single fault in any switch or link. If a switch or link fails, the subnetwork to which it belongs is identified, and the network is reconfigured to form a regular delta network which bypasses that subnetwork. Since the resulting delta network is equivalent, in every way, to the network which existed before the fault occurred, the W-Network achieves its primary goal, which is to tolerate a single fault with little or no drop in performance. (The W-Network can also tolerate two faults if one is located in the ISS and the other in the OSS. Thus, assuming component failures are independent, the W-Network can tolerate 50% of all double faults.)

The W-Network was developed, after considering other network designs, in a conscious effort to minimize the overall cost of fault tolerance in light of current VLSI technology. There are two types of costs associated with fault tolerance: the need to add redundant hardware to the network, and the loss in performance caused by this redundancy, under both fault-free and faulty network conditions.

3.1. Redundancy costs

The two main sources of redundancy costs in a network are extra chips and extra switch ports. While extra chips can be easily quantified, the costs of extra ports are often ignored. Extra ports are used by many FTMINs (including the W-Network) to create redundant pathways. Many of these networks were designed when integration levels were lower, but costs should be measured by the standards of current VLSI technology. Currently, a lot of switching logic and buffering can be packed into a chip. Since wiring inside a chip is cheaper than external wiring, it is desirable to have a topology which minimizes the number of chip boundaries. However, chips are more limited in the number of I/O signals they can have.

If data paths are made wide, to support high bandwidth and low latency, then today’s chip technology restricts b to a fairly small value, such as 4. Therefore, neither the internal b x b crossbar inside a switch chip nor the input/output buffers will be the main factor limiting chip area. Instead, the main factor will be the total number of pins on the chip.

If a chip is pin-bound, then its area is proportional to the square of the number of pins. This is because the connections can only be made at the perimeter of a chip. The same is true for most types of packaging. Thus, if the number of pins is increased by a factor of k, the area will increase by k2. This will increase cost by an even greater factor, because chip yields decrease with increasing chip sizes.

Page 10: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

424 K. B. THEOBALD

Table 1. Comparison of fault-tolerant networks

Network

Extra-stage (e.g., ESC) Extra Stage Gamma Dynamic Redundancy INDRA (k = 2 ) F network; Multibutterfly Enhanced IADM Augmented SE network W-Network

N = 64,b = 2 Extra Extra Total Chips Links

1 I6 0 0.166 116 112 0.75

65/64 112 2.02 1 0 1 0 1 1 0 33/28 1.18 0 112 0.50 118 1/12 0.22

N = 256,b = 4 Extra Extra Total Chips Links

Notes

1 I4 0 0.25 1,2 N/A 1 2 N/A 3

N/A 2 NIA 2.4

1 0 1

0 114 0.25 1,2 1/16 1/16 0.13

1. Faults may increase some path lengths. 2. Traffic may be imbalanced after a fault.

3. Spare processors can be switched in. 4. Requires three switch types.

A cheaper way to increase the number of links on a switch is to decrease the width of the data paths by the same factor, thereby keeping the total pin count about the same. To keep bandwidth the same, extra chips would have to be added to widen the data paths to their original widths. This would require some mechanism for splitting data among the chips, which might require some additional capacity (e.g. transmitting the routing information to each section). Therefore, we can conclude that the increase in chip count will be at least linear (proportional to the increase in the number of links per switch) and could be worse.

Therefore, we have tried to keep the number of extra ports to a minimum. There is only one extra port per switch. Most other FTMINs add at least two ports per switch.

In summary, the extra hardware costs of the W-Network are:

1.

2.

To

two extra subnetworks added to the 2 f i already in the non-fault-tolerant network (cost = -& times the number of chips in the MIN) An extra port added to each chip with a W-Switch. If all other chips use the original, non-fault-tolerant implementation, then the extra port need only be added to each chip in two out of the log, N stages.

compare the redundancy costs of the various fault-tolerant networks, we calculated each network's redundancy costs (extra chips and links) for two cases: a 64 x 64 network made of 2 x 2 switches, and a 256 x 256 network made of 4 x 4 switches (see Table 1). For each network, the Table shows the fractional increase in chips, links, and the combination of the two (the total shown in decimal form for easier comparison). Though most previous designs have only been described for 2 x 2 switches, we were able to derive redundancy costs for the second network by generalizing some (though not all) the designs in a consistent way.

One additional cost is the cost of having two input and output ports connected to each processor. Although the W-Network has explicitly incorporated this in its design, other networks would need dual ports as well in order to tolerate failures at the network ports (except for the Dynamic Redundancy network, which allows spare processors to be switched in). Most of the fault-tolerant network designs simply avoid the problem by assuming that the ports, or even the outer stages of the network, are fault-free.

Page 11: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

THE W-NETWORK 425

3.2. Performance considerations When measuring the costs of fault tolerance, another important consideration is the impact of fault tolerance on the overall performance of the network. Some fault tolerance techniques can reduce the performance of the network, even when there are no faults. For instance, fault-tolerant networks which use an extra stage of switches to create multiple pathways will have higher latency due to the increase in path lengths." Networks which increase the switching factor of the crossbar switches may cause the crossbars to slow down due to an increase in the lengths of the signal lines running through the crossbars.

As mentioned in Section 2.3., the W-Switch may add some extra latency due to the pass- gate multiplexing. However, the latency is small - only 1 or 2 ns per switch. For a 4-stage network, that would be no more than 8 ns, which would be less than the time required to go through an extra chip.

Another factor to consider is the effect of reconfiguration on overall performance. Perfor- mance can be affected both by increases in path lengths and uneven load distribution. Path lengths will increase uniformly in extra-stage networks when the extra stage is switched in. The ASEN uses links between elements in the same stage to bypass faults; these links will increase the lengths of paths affected by the fault.

Load distribution following a fault is a factor which is too often ignored. Suppose that a fault-tolerant network shifts some of the traffic from a faulty link to a healthy link, while the traffic already on the healthy link stays there. If the increased load on the link exceeds the bandwidth, then in the worst case this could slow down the entire computation if the communication occurs in a critical path in the computation. The bandwidth of each link can be increased to handle this condition (e.g. by widening the data paths or replicating the network), but this would increase the hardware, adding to the total redundancy cost. On the other hand, if the increased traffic never exceeds the bandwidth, then the network had excess capacity when there were no faults. This represents an 'opportunity cost' because this excess capacity could be removed from the non-fault-tolerant network, hence cutting its cost. Thus, either way, the uneven distribution of traffic following a fault imposes a redundancy cost.

The increase in traffic on the fault-free links would be negligible if traffic could be evenly redistributed throughout the network following a fault. However, this is generally not possible, because the structure of the delta network constrains the selection of alternate paths to a few logically related choices. For instance, if a switch in an ESC network fails anywhere other than the first or last stage (which would allow simple bypassing using multiplexers), the traffic going through that switch could only be rerouted to b - 1 related switches. Potentially, the load on these switches would increase by A.

The W-Network, on the other hand, always acts like a regular delta network, even after a fault. After the network is reconfigured, the new bandwidth along each active pathway is the same as before, There is no redistribution of traffic (only a reassignment of links), hence the performance will not change following a fault.

4. IMPLEMENTATION

W-Network switch chips were designed by three separate teams of students for a course project in the McGill University Electrical Engineering course Introduction to V a l . To

Some of these networks include multiplexers to bypass one stage when there are no faults, but these multiplexers also have delay.

Page 12: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

426 K. B. "HEOBALD

simplify the design effort to a level appropriate for this course, the chips were made fully synchronous and only 4 bits wide. All chips had internal 4 x 4 crossbars, were pipelined, and supported wormhole routing [8]. Beyond these functional requirements, students were given wide latitude in their implementation and performance decisions, leading to considerable differences in the final designs. All chips were designed and simulated for 0.8p CMOS.

Speeds of the different designs ranged from 50 MHz to 100 MHz. The W-Switches themselves had average delays of 0.511s for inputs and 0.3511s for outputs. (These used standard cells; full-custom switches would be faster.) How these delays affected overall speed was dependent on where the pipeline stage boundaries were located, since that would determine the critical paths. In the fastest design, the W-Switches were not in critical paths, and therefore the addition of fault tolerance had no impact on overall speed.

In terms of area, the internal crossbar is the component most sensitive to the data path width and number of inputs and outputs. Because a narrow data path was specified for this project, most of the die area was taken by the UO pads and drivers. Thus, the impact on area of fault tolerance was almost entirely due to the increase in number of ports. The reconfiguration logic accounted for an insignificant increase.

5. EXTENSIONS The W-Network can tolerate a single fault anywhere in the network. The W-Network can also be extended so that processor faults can be tolerated as well, through the use of spare processors. The ordinary switches in stages 0 and n - 1 can be replaced with W-Switch chips. The extra ports added can be connected to spares. This would increase the redundancy costs of the W-Network somewhat, because the switch chips with built-in W-Switches would have to be used in the outermost as well as the innermost stages. However, the redundancy would still be lower than for most other fault-tolerant networks, especially considering that most of them would also need additional hardware to allow spare processors to be switched in.

Throughout this paper, it has implicitly been assumed that the number of stages is even. The binary partitioning shown in Figure 2(a) yields the most cost-effective form of the W-Network. However, the W-Network principle applies to asymmetric partitions as well. The general method is to divide the delta network into an ISS containing bn-m parallel b" x b" subnetworks, and an OSS containing b" parallel bn-" x bn-" subnetworks. Each subnetwork stage is augmented by one extra subnetwork of the appropriate size, and the modified ISS and OSS are fully interconnected. Each processor output connects to two adjacent ISS subnetworks, and each processor input connects to two adjacent OSS subnetworks. This technique will work for any m between 1 and n - 1 (inclusive), but redundancy costs will be lower the closer m is to n/2.

The W-Network can be made to tolerate k arbitrary faults by adding k subnetworks to each subnetwork stage of the MIN instead of 1, and by making the W-Switch in each chip a bx(b+k) switch. A total of k ports must overlap between adjacent W-Switch chips in the same subnetwork. Also, the number of connections between each processor and the network must increase to k + 1. Clearly, the redundancy cost increases rapidly with k, but may be worthwhile if more fault tolerance is required.

6. CONCLUSIONS A low-cost fault-tolerant communications network is an important element in a large-scale multiprocessor such as EARTH[2]. This paper looked at the current trends in parallel

Page 13: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

THE W-NETWORK 427

processing in order to derive a set of performance and fault tolerance goals for the network of such a machine. It was argued that the network should transfer messages quickly and frequently, and that it should be able to continue normal operation after the occurrence of a single fault. Delta networks were chosen as the basis for the fault-tolerant design.

The W-network provides single-fault tolerance at arelatively low cost. In the W-Network, it is possible to create a regular delta network by choosing a subset of the switches and links, using special switches called W-Switches to isolate failed components. Once the W-Network has been configured, it can be operated like an ordinary delta network without any modification to the routing algorithm.

This property is maintained even after reconfiguration. so tbe load distribution is not affected by a single fault. Therefore, a reconfiguration will not lead to load imbalances which could cause excess contention for some links or chips. The W-Switches only add minimal latency to the network, an amount less than the delay through an extra chip.

It can be seen that the W-Network is conceptually similar to the Dynamic Redundancy network, in that both operate by choosing a subset of the links to form a regular delta network. However, the number of links and extra stages needed to form a fault-free delta network is far smaller in the W-Network. The DR network provides the uncommon, but useful, feature of being able to switch in spare processors, but the W-Network can be enhanced with this feature at little additional cost.

The W-Network is an appropriate choice for today’s VLSI technology. The number of extra ports per switch has been kept to a minimum, in order to allow wider data paths, which leads to lower latencies and higher bandwidths. It also requires only one extra switch chip type. Finally, the W-Network can be extended so that spare processors can substitute for faulty processors without the need for additional external hardware.

7. ACKNOWLEDGEMENTS

I would like to thank Guang R. Gao, Herbert Hum and Andres Marquez for reading draft versions of this paper and making valuable suggestions.

The following students at McGill implemented the W-Network chip (see Section 4): David Ardman, Ramy Azar, Annie Bedevian, Arman Hematy, Robert Mills, Ala Qumsieh, Kasia Radecka, Sherif Sherif, and Victor Tyan.

REFERENCES

I . Kevin Bryan Theobald, ‘Adding Fault-Tolerance to a Static Data Flow Supercomputer,’ Master’s thesis, 1990, Rep. MITLCSm-499, MIT Lab. for Comp. Sci., April 1991.

2. Herbert H. I. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Xinan Tang, Guang R. Gao, Phil Cupryk, Nasser Elmasri, Laurie J. Hendren, Albert0 Jimenez, Shoba Krishnan, An- dres Marquez, Shamir Merali, Shashank S. Nemawarkar, Prakash Panangaden, Xun Xue, and Yingchun Zhu, ‘A design study of the EARTH multiprocessor,” in Pmc. of the Intl. Con$ on Parallel Architectures and Compilation Techniques, Limassol, Cyprus, pp. 59-68, ACM Press, June 1995.

3. Dana S. Henry and Christopher F. loerg, ‘A tightly-coupled processor-network interface,’ Proc. of the Fifth IntI. Conj on Architectural Support for Programming Languages and Operating Systems, Boston, Mass., October 1992, pp. 11 1-1 22.

4. William I. Dally, ‘Directions in concurrent computing,’ in Proc. of the IEEE Inrl. Con$ on Computer Design, Port Chester, N.Y., October 1986, pp. 102-106.

Page 14: The W-Network: A low-cost fault-tolerant multistage interconnection network for fine-grain multiprocessing

428 K. B. THEOBALD

5 . Takeshi Hone, Kenichi Hayashi, Toshiyuki Shimizu, and Hiroaki Ishihata, ‘Improving APIOOO parallel computer performance with message communication,’ Proc. of the 20th Ann. Intl. Symp. on Comp. Architecture, San Diego, Calif., 1993, pp. 3 14-325.

6. Thorsten von Eicken, David E. Culler, Sech Copen Goldstein, and Klaus Eric Schauser, ‘Active messages: a mechanism for integrated communication and computation,’ Proc. of the 19th Ann. Intl. Symp. on Comp. Architecture, Gold Coast, Australia, May 1992, pp. 256-266.

7. Daniel M. Dias and J. Robert Jump, ‘Packet switching interconnection networks for modular systems,’ Computer, 14, (12), 43-53 (1981).

8. William J. Dally and Charles L. Seitz, ‘Deadlock-free message routing in multiprocessor inter- connection networks,’ IEEE Trans. Comput., 36, (5). 547-553 (1987).

9. Howard Jay Siegel, Wayne G. Nation, Clyde P. Kruskal, and Leonard M. Napolitano, Jr., ‘Using the multistage cube network topology in parallel supercomputers,’ Proc. of the IEEE, 77, (1 2),

10. Janak H. Patel, ‘Processor-memory interconnections for multiprocessors,’ in Proc. of the 6th Ann. Symp. on Comp. Architecture, Philadelphia, Penn., April 1979, pp. 168-177.

11. Chuan-Lin Wu and Tse-Yun Feng, ‘On a class of multistage interconnection networks,’ IEEE Trans. Comput., 29, (8), 694-702, (1980).

12. Jon G. Kuhl and Sudhakar M. Reddy, ‘Fault-tolerance considerations in large, multiple-processor systems,’ Computer, 19, (3), 56-67 (1986).

13. George B. Adams I11 and Howard Jay Siegel, ‘The extra stage cube: A fault-tolerant intercon- nection network for supersystems,’ IEEE Transactions Comput., 31, (5), 443-454 (1982).

14. Daniel M. Dias and J. Robert Jump, ‘Augmented and pruned N log N multistage networks: Topology and performance,” in Proceedings of the I982 International Conference on Parallel Processing, Bellaire, Michigan, August 1982, pp. 10-12.

15. C. S. Raghavendraand A. Varma, ‘INDRA: A class of interconnection networks with redundant paths,’ in Proc. of the Real-Time Systems Symp., Austin, Texas, IEEE Comp. SOC., Dec. 1984,

16. V. P. Kumar and S. M. Reddy, ‘Augmented shuffle-exchange multistage interconnection net- works,’ Computer, June 1987, pp. 30-40.

17. L. Ciminiera and A. Serra, ‘A connecting network with fault tolerance capabilities,’ IEEE Transactions Comput., 35, (6), 578-580 (1 986).

18. Robert J. McMillen and Howard Jay Siegel, ‘Performance and fault tolerance improvements in the inverse augmented data manipulator network,’ Proc. of the 9th Ann. Symp. on Comp. Architecture, Austin, Tex., April 1982, pp. 63-72.

19. Eli Upfal, ‘An O(1ogN) deterministic packet routing scheme,’ Proc. of the Twenty-First Ann. ACM Symp. Theoryof Computing, Seattle, Wash., May 1989, pp. 241-250.

20. Kyungsook Yoon Lee and Wael Hegazy, ‘The extra stage gamma network,’ Proc. of the 13th Ann. Intl. Symp. on Comp. Architecture, Tokyo, Japan, June 1986, pp. 175-182.

21. Menkae Jeng and Howard Jay Siegel, ‘A fault-tolerant multistage interconnection network for multiprocessor systems using dynamic redundancy,’ Proc. of the 6th Intl. Con$ on Distrib. Computing Systems, Cambridge, Mass., May 1986, pp. 70-77.

22. George B. Adams I11 and Howard Jay Siegel, ‘Modifications to improve the fault tolerance of the extra stage cube interconnection network,’ in Proc. of the 1984 Intl. Con$ on Parallel Processing, Bellaire, Michigan, Aug. 1984, pp. 169-1 73.

1932-1953 (1989).

pp. 153-164.