22
Haupt-Seminar Reliable Networks-On-Chip in the Many-Core Era ONLINE DIAGNOSIS OF NETWORKS-ON-CHIP University of Stuttgart Institut für Technische Informatik Summer Semester 2009 Author: Sebastian Klotz Advisor: Dipl.-Inf. Stefan Holst June 16 th , 2009

Online diagnosis of Networks-on-Chip

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Online diagnosis of Networks-on-Chip

Haupt-Seminar

Reliable Networks-On-Chip in the Many-Core Era

ONLINE DIAGNOSIS OF NETWORKS-ON-CHIP

University of Stuttgart

Institut für Technische Informatik

Summer Semester 2009

Author: Sebastian Klotz

Advisor: Dipl.-Inf. Stefan Holst

June 16th, 2009

Page 2: Online diagnosis of Networks-on-Chip

Contents

List of Figures ........................................................................................................................ 2

List of Tables.......................................................................................................................... 2

Abbreviations ......................................................................................................................... 2

Abstract .................................................................................................................................. 3

1 Introduction to Networks-on-Chip ...................................................................................... 3

1.1 Motivation for Online Diagnosis.................................................................................. 4

1.2 NoC Switch Architecture ............................................................................................. 4

2 System Level Fault Models................................................................................................. 5

3 Control Fault Detection and Diagnosis ............................................................................... 7

3.1 Methods used for Control Fault Detection ................................................................... 7

3.1.1 Distraction Detection............................................................................................. 7

3.1.2 Switch Count / Time-To-Life................................................................................ 8

3.1.3 Time-Out Detection............................................................................................... 8

3.1.4 Trapped Packet Detection ..................................................................................... 8

3.1.5 Sequence Number.................................................................................................. 8

3.2 Comparison and Discussion of Control Fault Detection Methods............................... 9

3.2.1 Fault Coverage Capability..................................................................................... 9

3.2.2 Fault Localization Capability .............................................................................. 11

3.2.3 Implicated Overheads.......................................................................................... 12

4 Data Fault Detection and Diagnosis .................................................................................. 13

4.1 Methods used for Data Fault Detection...................................................................... 13

4.1.1 End-to-End .......................................................................................................... 13

4.1.2 Switch-to-Switch ................................................................................................. 14

4.1.3 Code-Disjoint Detection...................................................................................... 15

4.2 Comparison and Discussion of Data Fault Detection Methods ................................. 16

4.2.1 Fault Coverage Capabilities ................................................................................ 16

4.2.2 Fault Localization Capabilities............................................................................ 17

4.2.3 Implicated Overheads.......................................................................................... 19

5 Conclusions ....................................................................................................................... 21

References ............................................................................................................................ 21

- 1 -

Page 3: Online diagnosis of Networks-on-Chip

List of Figures Figure 1: Architecture of a generic NoC switch..................................................................... 4 Figure 2: Illustration of the distraction detection method. ..................................................... 7 Figure 3: Switch incorporating distraction detection hardware. ............................................ 7 Figure 4: Illustration of the end-to-end scheme. .................................................................. 13 Figure 5: Illustration of the switch-to-switch scheme. ......................................................... 14 Figure 6: NoC switch with code-disjoint detection.............................................................. 15 Figure 7: 4x4 NoC with a switch and link fault. .................................................................. 17 Figure 8: Power consumption of e2e, s2sf, and cdd schemes [2]. ....................................... 19

List of Tables Table 1: Mapping of erroneous switch components to system level faults............................ 5 Table 2: Mapping of system level faults to fault detection methods...................................... 6 Table 3: Fault coverage of distraction detection for different NoC sizes [1]......................... 9 Table 4: Fault coverage of different control fault detection methods for a 5x5 NoC [1]. ... 10 Table 5: Area overhead introduced by embedding control fault detection methods [1]. ..... 12 Table 6: Set of network components being suspicious to produce faults. ........................... 17 Table 7: Area comparison of e2e, s2sf and cdd [3].............................................................. 20

Abbreviations ACK Acknowledgement cdd Code-disjoint detection CRC Cyclic Redundancy Check DED Double Error Detecting e2e End-to-end EDC Error Detection and Correction FIFO First-In-First-Out Flit Flow control unit lef Link error flag MUX Multiplexer NACK Negative acknowledgement NI Network Interface NoC Network-on-Chip PE Processing Element QoS Quality-of-Services s2s Switch-to-switch s2sf Flit-level switch-to-switch sef Switch error flag SaE Stuck-at East SaN Stuck-at North SaP Stuck-at Processor SaS Stuck-at South SaW Stuck-at West SEC Single Error Correction SoC System-on-Chip TMR Triple Modular Redundancy TTL Time-to-live VHDL VHSIC hardware description language VHSIC Very High Speed Integrated Circuits VLSI Very-Large-Scale Integration

- 2 -

Page 4: Online diagnosis of Networks-on-Chip

Abstract

With advent of the Many-Core Era the design of reliable Network-on-Chip (NoC) structures gains in importance. Thereby online diagnosis is a very important issue, because the operational reliability of NoCs has to be ensured during the entire life-span of the product. Therefore this paper introduces different online fault detection and diagnosis methods used for NoCs. A fault model is introduced, that serves as an abstraction instrument in order to handle altered switch behaviors, originating from faults in switch components. These models are further classified into control or data faults, depending on the location where these faults appear. Control faults occurring in the control part of the switch may result in an altered routing behavior. In order to cope with these faults different control fault detection methods are introduced such as distraction detection, switch count, time-out detection, trapped packet detection, and sequence number. On the other hand data faults occurring in the data path, which is responsible for physical transportation, affect the contained information. These kinds of faults can be tackled by means of data fault detection methods known as end-to-end (e2e), switch-to-switch (s2s), and code-disjoint detection (cdd). All data fault detection methods utilize error detection codes for fault coverage. Finally these fault detection methods are evaluated, regarding their fault coverage as well as fault localization capabilities, with respect to the implicated area and power consumption overhead.

1 Introduction to Networks-on-Chip

Continuous development in the field of Very-Large-Scale Integration (VLSI) technology enables circuit designers to integrate millions of transistors on a single die, by means of scaling down the structures towards the nano scale domain. This evolution makes System-on-Chip (SoC) design a viable remedy developing complex integrated systems with power consumption as well as manufacturing cost considerations. Global on-chip interconnects are not well suited for multi-core SoC applications, since they implicate a serious bottleneck providing performance and adequate power consumption. Investigations have shown that these issues can be addressed introducing Networks-on-Chip (NoCs), which enables packet-based communication amongst the cores by means of switches [1, 2, 3, 4]. Shrinking feature size as a side-effect increases the failure rate of the overall system. Faults occurring in the system can be classified into either permanent (e.g. stuck-at faults), intermittent (e.g. loose connections), or transient (e.g. radiation) faults. Defects insert during manufacturing are mainly of permanent nature and can be detected by means of offline-testing and diagnosis at the manufacturer. Contrary to permanent faults, intermittent and transient faults are likely to occur during operation. According to preliminary studies presented in [5], 80% of system failures actually result from transient faults. To address any unreliability of nano-meter technologies one has to implement fault tolerant mechanisms. These mechanisms are inevitable to provide a certain level of Quality-of-Services (QoS) such as reliability, throughput, latency and power dissipation, during the entire life-span of the product [2]. To cope with unreliability it is necessary to introduce error detection and correction methods, which are capable to identify altered network behavior and automatically resolve them by means of recovery mechanisms.

- 3 -

Page 5: Online diagnosis of Networks-on-Chip

1.1 Motivation for Online Diagnosis

Each switch component incorporated in a NoC is likely to produce errors during operation due to transient, intermittent, or permanent faults. Depending on the fault, either routing of the packet or incorporated data is affected. Using online diagnosis, fault detection can be performed without the need for additional test stimuli that implicate time, power, as well as performance penalties. After a fault is detected and diagnosed, meaning that the fault is discovered and located, this information is passed on to the recovery mechanism. Online diagnosis is also responsible to differentiate between faults of transient, intermittent, or permanent nature, since remedies applied differ. This classification can be made e.g. by means of “historical data” collected during network operation. For instance recovery from a transient fault is straightforward since the second attempt/retransmission will succeed. On the other hand permanent faults will always result in the same error. Hence it is necessary to keep track of errors occurred as well as their location – thereby online diagnosis serves as vital input for recovery mechanism. With occurrence of permanent faults recovery schemes usually utilize spares/idle components, or choose distinct routing paths to bypass failed components, in order to keep the overall system running. Applying recovery mechanisms assume that the running application does not fully utilize the NoC structure; in other words some redundancy is incorporated. If there is no redundancy included or already used up, further failing components cannot be tolerated and eventually the entire system fails. To sum up, the duties and responsibilities of online diagnosis are the detection, localization, and classification of faults during operation. This paper mainly focuses on the two aspects detection and localization; the issues of fault classification are not covered.

1.2 NoC Switch Architecture

NoC structures are usually composed of reusable generic NoC switches, which can be arranged e.g. in a 2-D mesh like fashion to simplify design, implementation, offline test and online diagnosis. A schematic switch in general consists of following components: The input and output ports, First-In-First-Out (FIFO) buffers, retransmission buffers (depending on the used retransmission scheme not always needed), multiplexers, a crossbar switch and last but not least the routing element as depicted in figure 1 (for clarity reasons, FIFO output and retransmission buffers are neglected) [4].

N

S

E

W

SE

WN

RES

N

MUX

ER

PS W

WE R N

SWNR

MU

X

MUX

MUX

MUX

Crossbar

Processor

Switch

Router

MUX

MU

X

FIFO

FIFO

FIFO

FIFO

FIFO

Head << Tail

N: North S: South E: East W: West P: Processor FIFO: First-In-First-Out MUX: Multiplexer

Figure 1: Architecture of a generic NoC switch.

- 4 -

Page 6: Online diagnosis of Networks-on-Chip

The NoC switch, that is considered here, possesses five I/O ports to connect the switch to adjacent switches (N, E, S, and W) as well as the Processing Element (PE). Each in- and output port comprises a FIFO buffer which can store a limited number of incoming and outgoing flow control units (flits) in a queue, respectively. Since the network’s flow of information is usually not homogeneously distributed, these buffers are necessary to prevent flits to be dropped. Depending on the applied communication scheme, additional retransmission buffers might be needed that hold a copy of the transmitted data until the correct reception is confirmed. Each output comprises a multiplexer which can select inputs (head of the input FIFO buffer queues) of any other port. Multiplexers and input FIFO buffers are connected via a crossbar switch. According to the routing information provided in the header flit, the router determines the correct output port by means of multiplexer select-signals [4].

2 System Level Fault Models

One of the most important issues to embed fault tolerant schemes into NoCs is to know, how failed switch components are going to affect the exchange of information amongst switches. A fault model covers important properties like the behavior of a failure mode, by concealing complex details. The diversity of online faults, originating from failures of specific switch components (structural view), can be abstracted by using a system level fault model (functional view). The altered behavior of a switch can be classified into different fault models known as dropped data faults, direction faults, multiple copies in space faults, multiple copies in time faults, and the corrupted data fault [1, 4]. Altered switch behaviors covered by fault models may result from more than one component. For clarity reasons table 1 provides an overview to all fault models, as well as from which faulty switch component(s) these faults originate. Subsequently all fault models are going to be introduced in detail.

Erroneous Switch Components System Level Faults

Router FIFO buffer Multiplexer Crossbar Switch

Dropped Data Fault X X X

Direction Fault X

Multiple Copies in Space Fault X X

Multiple Copies in Time Fault X

Corrupted Data Fault X X X X

Table 1: Mapping of erroneous switch components to system level faults.

Dropped Data Fault:

A dropped data fault describes a fault in the FIFO buffer, the routing element, or the multiplexers, where received data is not forwarded to the correct output port and dropped instead [4].

Direction Fault:

In case of a direction fault the select signal of the multiplexer is not set properly due to a fault in the routing element. Hence received data is eventually forwarded to an output that was actually not intended to [4]. A special subset of direction faults is called stuck-at port faults, because these faults forward incoming data always to one specific switch port due to a permanent fault [1]. These stuck-at port faults can be further classified into Stuck-at North (SaN), Stuck-at East (SaE), Stuck-at South (SaS), Stuck-at West (SaW), and Stuck-at Processor (SaP), according to the direction they direct.

- 5 -

Page 7: Online diagnosis of Networks-on-Chip

Multiple Copies in Space Fault:

A multiple copies in space fault results from a fault in either the routing element or the multiplexer. As a result, received data is forwarded twice, firstly to the correct output port, and secondly to an additional port [4].

Multiple Copies in Time Fault:

In multiple copies in time fault received data is forward to the correct output more than once. This results from a fault in a FIFO buffer [4].

Corrupt Data Fault:

The corrupt data fault results from a fault in either the router, FIFO buffer, multiplexer, or crossbar switch. Thus, received data is forwarded to the correct output port but one has to keep in mind that included data is eventually altered and in the end unusable [4]. Before we introduce methods to cope with each of these fault models, it is necessary to make a further distinction. A switch can be interpreted as to be composed of a data path and a control part [2]. The data path incorporates the circuitry, which is responsible for the physical transport of data between the in- and output ports. On the other hand the control part figures out, where the incoming data has to be forwarded to by means of the included routing information. Hence, it is quite obvious that faults in the data path and control part cannot be addressed using the same error detection and correction mechanisms. As a result fault models are classified to be either of control or data fault nature:

Control Faults: Dropped data fault, direction fault, multiple copies in space fault, and multiple copies in time fault.

Data Faults: Corrupted data fault.

These two fault classes can be addressed using either control fault detection or data fault detection methods, respectively. Control fault detection methods, proposed in [1, 4], are distraction detection, switch count, trapped packet detection, sequence number, and time-out detection. Data detection methods, addressing data faults are end-to-end (e2e), switch-to-switch (s2s), and code-disjoint detection (cdd) [2, 3].

Network-on-Chip Error Detection Methods

Control Fault Detection Data Fault D. System

Level

Faults Distraction Detection

Switch Count

Trapped Packet

Detection

Sequence Number

Time-out Detection

e2e s2s cdd

Dropped Data Faults

X X

Direction Faults X X X Multiple Copies in Space Faults

X X

Multiple Copies in Time Faults

X

Corrupted Data Faults

X X X

Table 2: Mapping of system level faults to fault detection methods. Table 2 provides an overview over all fault models and how each of them can be addressed applying the recently introduced control- and data fault detection schemes. All recently mentioned control fault detection and data fault detection methods are going to be discussed in chapter 3 and 4 in detail, respectively.

- 6 -

Page 8: Online diagnosis of Networks-on-Chip

3 Control Fault Detection and Diagnosis

3.1 Methods used for Control Fault Detection

This chapter introduces control fault detection methods, which are used to handle errors occurring in the control part of the switch. A comparison and discussion regarding fault coverage/localization capabilities as well as overheads they implicate is going to be given in chapter 3.2.

3.1.1 Distraction Detection

The distraction detection method addresses direction faults, caused by stuck-at port or transient faults, occurring in the routing element during operation. A prerequisite to use distraction detection is a mesh-based NoC structure, performing e.g. the XY routing algorithm. In this routing scheme, data is first forwarded in X- and then in Y-direction towards the destination switch. As switching method wormhole switching is applied [1]. Figure 2 depicts a 3x3 NoC performing the XY routing algorithm – e.g. dispatching a packet from S2 to S4 will result in the path PS2 → S4 = {S2, S1, S4}. According to this control fault detection method each switch verifies, whether incoming data was intended to traverse this specific switch or not. Hence, distraction detection is not capable to detect SaP faults, since the direction fault cannot be discovered by another switch. The distraction detection rule, proposed in [1], works as following: Since the packet is first routed in X- and then in Y-direction, either the Y-location of the source and the actual switch, or the X-location of the destination and actual switch, equal. This comparison is performed by means of the switch location and routing information (location of source and destination) using additional hardware, incorporated in the routing element as depicted in figure 3.

S1[0,0]

S4[0,1]

S8[1,2]

S5[1,1]

S6[2,1]

S2[1,0]

S9[2,2]

S3[2,0]

S7[0,2]

SaE

SaN

SaN

8

[x,y]

Router

=?

ComperatorSwitch Address

Packet Source AddressPacket Destination Address

From other switches

ErrorNo

Figure 2: Illustration of the distraction detection method.

Figure 3: Switch incorporating distraction detection hardware.

In the following three examples of stuck-at port faults occurring in this NoC, their influences on the routing process, and the control fault detection capabilities of the proposed distraction detection method, are regarded. In the first example, the PE which is adjacent to S4 dispatches a packet with S9 as destination; the expected routing is PS4 → S9 = {S4, S5, S6, S9}. Due to the SaN fault at S5, the packet is deflected to port “north” instead of continuing switch traversal in X-direction. At this routing step the Y-location of the actual S8 and source S4, as well as the X-location of the actual S8 and destination S9, are not equal. This indicates a violation of the XY routing algorithm, in other words: a deflection, and is detected by the additional hardware. Distraction detection is not capable to detect all stuck-at port (stuck-at north, east, south, and west) and transient faults, because it just considers the functional behavior of the switch. This is demonstrated in the following two examples. In the second example, S2

- 7 -

Page 9: Online diagnosis of Networks-on-Chip

sends a packet to S6; the expected routing is PS2 → S6 = {S2, S3, S6}. Since the SaN fault in S3 corresponds with the intended output, in other words: S3 acts like expected; there is no distraction and the SaN fault eventually goes undetected. In the third example, S8 sends a packet to S4; PS8 → S4 = {S8, S7, S4}. Due to the SaE fault in S7 the packet is reflected (back) to the sending S8 and eventually starts circle between S7 and S8 (livelock). Also this fault can not be detected, since the XY-routing policy is not violated. As there was no ACK (acknowledgement) signal from the destination, the source switch assumes that the packet was dropped and may starts retransmission. Several retransmission attempts are going to result in the same problem, and eventually those switches stop providing any service, because the FIFO buffers are fully loaded by identical packets. A possible remedy to cope with circling packets is to extend distraction detection by the switch count method, which keeps track of the number of passed switches.

3.1.2 Switch Count / Time-To-Life

Due to a special set of malicious direction faults, packets start circling between adjacent switches. Resolving from this kind of livelock, the switch count method introduces an additional so called switch count field [1]. Every time a packet passes a switch, this field is incremented by “1”. A circling packet is finally going to result in an overflow of the switch count field, indicating that this packet should be dropped, as it just occupies resources without making any progress. This overflow also serves as valuable input for recovery mechanisms, because a circling packet may result in a time-out error that eventually triggers the retransmission of the packet. To avoid that this attempt results in the same problem, incorporated recovery mechanism should select a distinct route for retransmission.

3.1.3 Time-Out Detection

The time-out detection method addresses dropped data faults, caused by faults in the router, FIFO-buffer and/or multiplexer during operation. When this control fault method is applied, each sending node incorporates a time-out mechanism which works as following: A timer is started at the sending node whenever it dispatches a packet. The receiver confirms the reception of a packet by raising the ACK/NACK (acknowledgement/negative acknowledgement) signal. As a result of this, the timer at the sending node is stopped. If a packet got lost during transmission, the timer of the sending node expires and invokes a retransmission [3].

3.1.4 Trapped Packet Detection

The trapped packet detection, as proposed in [1], addresses stuck-at processor faults that cannot be discovered by the distraction detection method. For every packet that is forwarded to the processing element, a comparison between the desired destination and actual switch location is performed. Hence, on the occurrence of a SaP fault the additional hardware indicates that this packet was misdirected, in other words “trapped”.

3.1.5 Sequence Number

To address dropped data as well as multiple copies in space/time faults, [4] proposes the so called sequence number method. In this approach the processing element adds a unique sequential number in form of a flit to the original information. At the destination node a watchdog counter keeps track of incoming packet sequences. Multiple copies in time space/faults are going to result in at least two packets with identical sequence numbers, indicating that a fault occurred during delivery. On the other hand, dropped packets can be easily identified because the sequence number of dropped packets will not occur at the receiver.

- 8 -

Page 10: Online diagnosis of Networks-on-Chip

3.2 Comparison and Discussion of Control Fault Detection Methods

This section provides a comparison and discussion of recently presented control fault detection methods. Thereby the focus is going to be on the fault coverage as well as the fault localization capabilities with respect to the implicated area overhead.

3.2.1 Fault Coverage Capability

The results of an evaluation focusing on the fault coverage of the control fault detection methods distraction detection, switch count, and trapped packet detection, are published in [1]. As this publication does not cover dropped or duplicated data faults, the fault coverage of sequence number and time-out detection will be argued afterwards in a separate treatment.

Distraction Detection, Switch Count and Trapped Packet Detection:

Using a high level VHDL based platform, the authors of [1] discovered fault coverage capabilities of different combinations of control fault detection methods for the NoC sizes 3x3, 5x5, and 7x7. The evaluated control fault detection combinations presented in [1] are

pure distraction detection, distraction detection with switch count, distraction detection with trapped packet detection, and distraction detection with switch count and trapped packet detection.

Publication [1] provides fault coverage results concerning all combinations of control fault detection versus NoC sizes. The following discussion will just focus on two comparisons to grasp key ideas of their interplay. The first comparison exemplary illustrates fault coverage for one control fault detection combination, with respect to different NoC sizes. In the second comparison the fault coverage is regarded with a specific NoC size, with respect to different control fault detection combinations.

Addressed Switches NoC Architecture 25% 50% 75% 100% 3x3 30% 45% 67% 67% 5x5 36% 57% 64% 74% 7x7 32% 63% 71% 76%

Table 3: Fault coverage of distraction detection for different NoC sizes [1]. Table 3 shows the fault coverage of distraction detection, depending on the NoC architecture as well as the number of addressed switches. The fault coverage, defined as the ratio of observed control faults over the number of total existing control faults, is discovered by means of routing random packets through the network structure. The number of addressed switches is the ratio of switches chosen as destination at least once, over the total number of switches in the network. Regarding fault coverage for different NoC architecture sizes, it can be seen that fault coverage rises by increasing the size of the network. In larger networks packets have to traverse longer paths in average. Since a packet passes more networks components, it is more likely that control faults are detected. Regarding the fault coverage for different numbers of addressed switches, the fault coverage is obviously much higher, when each node is addressed as destination. As distinct destinations result in different paths, more network components are covered in total. While mapping the application to specific cores in the NoC, it should be stressed that sources and sinks are homogenously distributed, in order to achieve higher fault coverage.

- 9 -

Page 11: Online diagnosis of Networks-on-Chip

Table 4 exemplary shows the fault coverage for a 5x5 NoC architecture, depending on control fault detection methods and the number of addressed switches. As expected, distraction detection has the highest fault coverage, because it addresses all deflection faults except the SaP fault. On the other hand the switch count method does not contribute much to fault coverage, as circling packets are less likely to occur. Firstly the packet has to traverse a switch with a stuck-at port fault, and secondly this port fault has to point in the counter direction of the incoming packet. Although fault coverage capabilities are lower, it is desirable to implement switch count, as it recovers from livelocks caused by reflection. Trapped packet detection on the other hand has higher fault coverage, since affected SaP switches cannot dispatch data and SaP switches on the routing path “catch” packets.

Addressed Switches Control Fault Detection methods ≈ 6

(25%) ≈ 13

(50%) ≈ 19

(75%) 25

(100%) Distraction Detection 36% 57% 64% 74% Distraction Detection with Switch Count 44% 65% 69% 76% Distraction Detection with Trapped Packet Detection 52% 79% 88% 98% Distraction Detection with Switch Count and Trapped Packet Detection

60% 91% 95% 100%

Table 4: Fault coverage of different control fault detection methods for a 5x5 NoC [1]. According to [1] only distraction detection with switch count and trapped packet detection is able to discover all stuck-at port faults for all NoC sizes, under the condition that each switch is addressed at least once.

Time-Out Detection:

The time-out detection handles packets which are dropped during transmission. The timer is started as soon as the packet leaves the sender, and is stopped when the packet arrives at the destination. Reception is confirmed by raising either the ACK or NACK signal. Therefore the two reasons why time-out detection may not always work can be stated as the timer and ACK/NACK signals. The timer is usually implemented as a counter, driven by a clock at the sending PE, whereas the overflow indicates the expiration of the time limit. This means that it is up to the PE to ensure correct behavior of the timer. The expiration of the timer indicates that the packet does not reach its destination within the assumed time frame. The time frame has to be adapted according to the NoC size, assumed traffic in the network as well as other factors that influence the overall latency. Choosing a time frame that is too huge is going to result in longer reaction times until retransmission is invoked. On the other hand, if the time frame is chosen too little or given that average latency increases e.g. due to a temporal traffic peak, retransmission is initialized although the packet is still traversing the network. This fault results in duplicated data fault, which can be addressed by the sequence number detection. Exclusive ACK/NACK signals can be implemented in a fault tolerant way applying TMR (Triple Modular Redundancy). An example covering this is shown in chapter 4. On the other hand ACK/NACK signals, dispatched in a piggy-back fashion utilizing packets, are protected by the same control- and data fault detection methods as usual packets. Assuming that the timer at the sending node is always functional and ACK/NACK signal intercommunication works in a correct way, time-out detection has a fault coverage of 100%.

- 10 -

Page 12: Online diagnosis of Networks-on-Chip

Sequence number:

The sequence number method addresses control faults, which result in a dropped data and multiple copies fault. Assuming that the sequence number field is generated, transmitted as well as evaluated correctly, fault coverage is 100%. In other words, all dropped and duplicated packets are detected. Sequence number generation and comparison, are performed in the source and destination PE, respectively. Therefore it is up to the PEs to ensure correct generation and comparison. The corruption of the sequence number field, introduced during packet transmission, is covered by data fault detection methods. Hence, the assumption made at the beginning applies and fault coverage is 100%.

3.2.2 Fault Localization Capability

Besides control fault detection another important issue is the localization of faulty components, which serve as valuable input for recovery mechanisms. Therefore this section discusses the fault localization capabilities of the recently introduced control fault detection methods.

Distraction Detection:

In the case of some distraction detection faults, such as the deflection (SaP fault excluded), localization is straight forward. The violation of XY-routing rule is detected by the successive switch and can easily be reported to the recovery mechanism utilizing the NoC by sending a packet [1].

Switch Count:

On the other hand reflection resulting in a livelock cannot be discovered by distraction detection. The switch count method resolves this situation, but is eventually not able to determine the exact fault location [1].

Trapped Packet Detection:

Furthermore fault localization in trapped packet detection is unambiguous. Due to the SaP fault, the processing element is unfortunately not capable to report the fault location as outgoing packets are “trapped”. In other words, the packet is reflected back to the PE [1] because the adjacent NI (Network Interface) is erroneous. A possible remedy would be to introduce an additional flag indicating SaP faults. This flag has to serve as direct input for the recovery mechanism, as the NoC can not be utilized to forward this information. According to great NoC sizes, this results in a complex wiring.

Time-Out Detection and Sequence number:

For the sequence number as well as the time-out detection method, any localization of faults is not possible. They are just capable to observe, whether packets got lost during transmission or appear in a duplicated form at the final destination.

- 11 -

Page 13: Online diagnosis of Networks-on-Chip

3.2.3 Implicated Overheads

Besides fault coverage and fault localization capabilities gained from sophisticated control fault detection methods, another important measure is the overhead they implicate. In case of control fault detection the main concern is the area needed to implement the desired functionality. Depending on the control fault detection method, an additional area overhead is either added to the PE(s), the switches or both.

Distraction Detection, Switch Count and Trapped Packet Detection:

To estimate hardware overhead introduced by embedded control fault detection methods, [1] applied these detection schemes to available NoC switch structures. Regarding the results of these investigations shown in table 5, pure distraction detection introduces the smallest total area overhead at (very) high fault coverage. Applying the switch count method results in a high area overhead, because a counter and overflow detection is necessary. As already mentioned, switch count detection does not contribute much to fault coverage, but is necessary to resolve livelocks. Applying trapped packet detection adds area overhead only to the receiving PEs. Applying all control fault detection methods proposed in [1], distraction detection with switch count and trapped packet detection, leads to a total area overhead of 1.6%.

Area Overhead Control Fault Detection methods Switch

Overhead Pro Core Overhead

Total Overhead

Distraction Detection 1.6% 0% 0.2% Distraction Detection with Switch Count 9.3% 0% 1.4% Distraction Detection with Trapped Packet Detection 1.6% 0.3% 0.5% Distraction Detection with Switch Count and Trapped Packet Detection

9.3% 0.3% 1.6%

Table 5: Area overhead introduced by embedding control fault detection methods [1]. In addition to the area overhead, the switch count method also requires a switch count field, in which the number of traversed nodes is stored. This introduces a packet overhead of one flit (information redundancy), which has to be transported in addition to the usual data over the network. Also distraction detection and trapped packet detection require additional information, such as the source and/or destination address. Due to the fact that this information is already included for routing purposes, it is not taken in consideration as overhead.

Time-Out Detection:

Time-out detection is performed at the sending PE. Therefore the timer incorporated in the sending PE requires some area overhead. Information redundancy in form of an additional flit is not required.

Sequence Number:

The sequence number detection requires additional area at the source as well as destination node to implement the generation and verification of sequential numbers, respectively. In addition one more flit, attached to the packet, holding the sequence number is also needed.

- 12 -

Page 14: Online diagnosis of Networks-on-Chip

4 Data Fault Detection and Diagnosis

In addition to the detection of altered control behavior, it has to be ensured that corrupted data is detected. Therefore this chapter introduces online diagnosis methods to embed data fault detection mechanisms into NoCs. All methods introduced in this chapter are mainly based on error detection and correction codes. For this reason a short introduction to Error Detection and Correction (EDC) will be given in advance. Error detection and correction mechanisms are traditionally used in communication systems, where information redundancy is added to the original information, in order to recover from mostly transient errors occurring in a transmission channel. These mechanisms can also be applied to circuit design, where faults such as random bit flips, stuck-at faults or delay faults, affect the correct circuit behavior. On the one hand they are capable to correct single and detect multiple errors. On the other hand they imply latency, hardware overhead, and introduce information redundancy that increase manufacturing costs and power consumption. Dependent on the application, designers have to trade-off between the pros and cons of sophisticated EDC schemes. The same considerations should be made, when applying error detection and correction mechanism to NoC in order to enable data fault detection.

4.1 Methods used for Data Fault Detection

The following subchapters introduce three data fault detection methods, proposed in [2, 3], known as end-to-end (e2e), switch-to-switch (s2s), and code-disjoint detection (cdd). For comparability as well as simplification reasons, all schemes utilize an error detection code (parity-check code) for detection and retransmission as error correction mechanism. Applying more sophisticated error detection codes, such as dual-rail or Cyclic Redundancy Check (CRC), results in better detection capabilities. Correction by retransmission implies to buffer data, e.g. at the source node, until the correct transmission is confirmed by the recipient. Depending on the applied data fault detection method, retransmission buffers are located at different positions in the network. Besides the error correction by means of retransmission, it is also possible to apply more sophisticated error detection and correction methods such as Hamming Codes in addition, to further increase fault coverage as well as tolerance. Further discussions, regarding fault detection, tolerance, as well as correction capabilities, are given in section 4.2.1.

4.1.1 End-to-End

In the end-to-end (e2e) scheme, proposed in [3], the source node encodes its data on the packet level by adding error detection codes to the original information. At the destination node data is decoded and verified as depicted in figure 4. Errors occurring during transmission do affect data and/or checks bits and eventually make the entire packet unusable.

IP IP

Encoder Decoder

Switch A Switch B

Sender NI Receiver NI

Packet buffer

Queuing buffer

Credit signal

Data

Figure 4: Illustration of the end-to-end scheme.

- 13 -

Page 15: Online diagnosis of Networks-on-Chip

The receiver drops an erroneous packet and triggers the sender with a NACK signal to retransmit this specific packet, which is kept in the packet buffer of the sender Network Interface (NI). An error-free packet is confirmed with an ACK signal, which causes the sender to remove it. The implementation of exclusive ACK/NACK signals in the e2e scheme is complicated because it requires complex interconnections among the source and the destination nodes. Therefore it is reasonable to utilize the already existing NoC architecture by sending a response packet in a piggyback fashion instead. These packets implicate non-deterministic timing behavior and eventually increase the overall network load. To address errors affecting response packets each source node possesses a time-out mechanism which invokes the retransmission of the packet if a certain amount of time expires. In the e2e detection scheme faults that occurred during network traversal can only be discovered at the final destination. Any localization of the fault position is not possible because the fault could result from each component on this path. For transient faults, a retransmission using the same route is very likely to succeed. This cannot be taken into consideration for permanent faults, either in switches or links, which always result in the same error. In this situation, the retransmitted packet should use a distinct route that bypasses the erroneous network component, which requires additional knowledge about the fault location. Therefore it is desirable to have a fault detection scheme which operates at a finer granularity level. One proposal is the switch-to-switch (s2s) scheme, which is discussed in the following.

4.1.2 Switch-to-Switch

In contrast to the e2e scheme, the switch-to-switch (s2s) scheme, proposed in [3], comprises an input error detection circuit and retransmission buffer in each switch and can either operate on flit or packet level. Errors can be detected and dropped at a very early stage, which decreases time penalty and switching activity significantly. To recover from an error, it is not necessary to start from the very beginning, because the preceding switch still holds a verified copy in the retransmission buffer. That is depicted in figure 5.

Decoder

Switch B

IP IP

Encoder DecoderSwitch A

Sender NI Receiver NI

Packet buffer Circular

(queuing and retransmission)

buffers

Credit signalData

Decoder TMR

TMR

ACK

NACK

Figure 5: Illustration of the switch-to-switch scheme.

Contrary to the e2e scheme, s2s possesses exclusive ACK/NACK signal lines that allow simple implementation, as only adjacent switches have to be connected. Since a stuck-at fault would lead to a permanent communication error, it is advisable to introduce fault tolerant mechanism (e.g. TMR) to protect these signal lines. In a TMR scheme the output remains correct as long as at least 2-out-of-3 signal lines carry the correct information.

- 14 -

Page 16: Online diagnosis of Networks-on-Chip

The s2s scheme possesses much better fault localization capabilities. Each switch verifies the incoming data by its own. An error detected at the switch input occurred at the preceding switch or link connection – unfortunately a clear distinction is not possible. In the occurrence of permanent faults, the worst case assumption applies. Hence the switching element is suspicious to be faulty and due to this, the adjacent links cannot be used any longer. Regarding fault tolerant strategies, this approach is very coarse. A lot of fully functional network components are removed from the useful set. Beyond that, the adjacent processing element is not reachable any longer and as a result cannot provide service any more. In order to differentiate between faults occurred in either switch or in inter-switch links, additional output error detection is necessary. This leads to a code-disjoint error detection scheme, which is going to be discussed in the following section.

4.1.3 Code-Disjoint Detection

In contrast to the s2s scheme, the code-disjoint detection (cdd) scheme, introduced in [2], performs error detection at the in- as well as the output of each switch component. Code-disjoint routing elements can therefore differentiate between faults occurred in the switches or inter-switch links. Figure 6 shows a possible implementation of a code-disjoint NoC switch, using a parity check code as an error detection mechanism. The input parity prediction Pi(Xi) (for clarity reasons only depicted at the west input port) recomputes the current parity of the incoming flit Xi. If an error occurs during transmission the recomputed and incoming parity Xip differ. As a result the link error flag (lef) is set, the erroneous flit is dropped immediately (without storing it in the FIFO buffer) and retransmission is initiated by raising the NACK signal. In contrast if both parities match, thus no error occurred, both incoming flit Xi and incoming parity Xip are forwarded to the intended output port and become the outgoing flit Xo and outgoing parity Xop. Before the flit finally leaves the NoC switch output, recomputation and comparison is performed again, to ensure that no errors were introduced during the routing process. If an error was detected at the output error detection circuit the switch error flag (sef) is set. As a result the corrupted flit is dropped and retransmission is invoked.

N

S

EW

P

FIFO

link error flag

Router

Po(Xo)

switch error flag

Xi

Xip

XoXop

Pi(Xi)

Pi(Xi): Input parity predication Po(Xo): Output parity predication Xi: Incoming flit Xip: Incoming parity Xo: Outgoing flit Xop: Outgoing parity

Figure 6: NoC switch with code-disjoint detection.

NoC switches with code-disjoint property, can detected and thus drop erroneous flits at an early stage. Therefore processing effort of corrupted data is kept at the minimum level, which results in reduced latency and power consumption. Nevertheless all presented error detection schemes introduce additional costs in form of hardware (area on the die) and power consumption.

- 15 -

Page 17: Online diagnosis of Networks-on-Chip

4.2 Comparison and Discussion of Data Fault Detection Methods

This chapter provides a comparison and discussion of the recently presented data fault detection methods. Thereby the focus is on fault coverage as well as fault localization capabilities, with respect to the implicated overhead such as area and power consumption. As online diagnosis is performed, it is self-explanatory that data faults due to faulty components can only be discovered when these components are used.

4.2.1 Fault Coverage Capabilities

Fault coverage capabilities of data faults mainly depend on the applied error detection methods. In the following pros and cons of pure error detection are going to discussed versus detection and correction codes, with respect to the introduced data fault detection methods e2e, s2s, and cdd.

Error Detection and Correction Codes:

In general error detection codes are not capable to discover all kinds of errors. They rather address a special set of errors that are likely to occur in a specific application. For instance the parity-check code can only detect odd numbers of errors. Besides the classic parity-check codes also other error detection codes such as dual-rail or Cyclic Redundancy Check (CRC) codes are used in NoC applications. They have more sophisticated error detection capabilities, resulting in higher fault coverage by introducing additional overheads. The most mentionable concerns are area, power consumption and latency, due to encoding and decoding as well as additional signal lines that carry the redundant information. According to preliminary studies [2, 3], the ability to correct corrupted data using EDC schemes is not well suited for NoC application, assuming that just transient faults do occur. Error correction requires a lot of power as well as area overhead originating from information redundancy. Furthermore it degrades NoC performance, because encoding and decoding is more complicated compared to simpler error detection codes recently mentioned. A much more power as well as area efficient method is to retransmit affected data after detection, assuming that the transient error meanwhile disappeared. In any case these packets/retransmission buffers are necessary, as error detection and correction codes only have limited correction capabilities as already discussed. Besides transient and also permanent faults in the data path, EDC codes become more desirable. In a retransmission scheme, a permanent single bit error that is caused by a stuck-at fault always leads to the same error. Hence the failed network component is detected by the recovery mechanism and bypassed in the future. On the other hand utilizing more sophisticated EDC allows to tolerate and correct permanent faults. For example the (8, 4) Hamming code, has single error correction and double error detection (SEC/DED) capabilities. Applying this code to the data path of a NoC, allows to tolerate one permanent error by correcting it. The occurrence of an additional fault is detected and retransmission is performed. Assuming that the second fault is of transient nature, the retransmission attempt is going to succeed. On the other hand SEC/DED Hamming codes implicate a lot of additional overhead – in terms of area, power consumption and latency.

End-to-End, Switch-to-Switch and Code-Disjoint Detection Schemes:

So far the coverage capabilities of pure error detection and correction codes were just considered. Fault coverage also differs depending on the applied data fault detection scheme (e2e, s2s, or cdd), because the number of data verification and distances (path lengths) among them vary.

- 16 -

Page 18: Online diagnosis of Networks-on-Chip

Dependent on the applied error detection code, end-to-end data fault detection may result in poorer fault coverage, as the data is encoded at the sender and decoded/verified only at the destination. Depending on the application, e.g. larger NoC sizes (longer path lengths), or the exposure to above-average radiation (more transient faults), the probability that more than one error occurs, rises. In end-to-end schemes parity-check codes might not be appropriate, as path lengths between verification points are relatively long and even numbers of errors go undetected. Therefore it is advisable to incorporate more sophisticated codes, such as a Cyclic Redundancy Check (CRC) code, in order to achieve better fault coverage capabilities. In contrast to the e2e scheme, the s2s scheme performs validation of the data at each switch input. As a result the path length in which errors might be introduced remains the same; independent of the NoC size. The path between two validation points in s2s just consists of one switch and one link. Therefore it is less likely that more than one error occurs. As a result, fault coverage using parity-check codes in a s2s scheme is much higher than in an e2e scheme, with respect to large NoC sizes. In the s2s scheme, parity-checking might be sufficient and more sophisticated error detection codes such as CRC are not needed. That goes along with the cdd scheme where data validation is performed when data enters and leaves a switch. That means, the path between two validation points is only either one switch or one link.

4.2.2 Fault Localization Capabilities

End-to-End, Switch-to-Switch and Code-Disjoint Detection Schemes:

In this section fault localization capabilities of e2e, s2s and cdd schemes are compared. For this reason a 4x4 NoC with two fault locations is assumed. Firstly a failed switch Sw1 and secondly a failed inter-switch link L5, as depicted in figure 7. Traversing data through path I and II results in an error identified by all detection schemes, but at different points in time.

Sw1

D1 D2

Sw7

Sw6

Sw5Sw3S2

Sw2S1

Sw4

L1 L2

L3

L4 L6

L9

L8

L7

L5

I

II

S: Source switch Sw: Switch D: Destination switch L: Link

Figure 7: 4x4 NoC with a switch and link fault.

Depending on how sophisticated the applied data fault detection scheme works, different sets of suspicious fault locations are obtained [2], as depicted in table 6.

Path I (S1 to D1): Path II (S2 to D2): e2e = {S1, L1, Sw1, L2, Sw2, L3, D1} s2s = {Sw1, L2} cdd = {Sw1}

e2e = { S2, L4, Sw3, L5, Sw4, L6, Sw5, L7, Sw6, L8, Sw7, L9, D2} s2s = { Sw3, L5} cdd = { L5}

Table 6: Set of network components being suspicious to produce faults.

- 17 -

Page 19: Online diagnosis of Networks-on-Chip

From results of the e2e scheme one can infer, that the number of components suspicious to be defective, rises with increasing path lengths. Therefore it can concluded, that this scheme is not suited for very large mesh sizes. An unambiguous localization is not possible. The results of the s2s and cdd scheme show, that fault localization capabilities of these approaches are independent of the path length. Using the s2s scheme the recently traversed switch or the inter-switch link can be isolated to be suspicious. In case of permanent faults the switch and adjacent links are masked as defective. Only the code-disjoint detection scheme allows clear fault location, but at the highest hardware overhead.

High Level Online Fault Diagnoses:

Another possibility to reduce the number of suspicious components, e.g. in the end-to-end scheme, is a high level online fault diagnoses method proposed in [4], that works on “historical data”. Thereby online diagnosis maintains a list of network components, being suspicious to produce faults. Every time a packet reaches its destination, error detection codes verify whether a data fault occurred during transmission and update the list as following: A packet that passes through the network components (switches and links) can be either added to or removed from the set of suspicious components depending on whether an error occurred during transmission or not. To exemplify how this online diagnosis method works, reconsideration of the previous example of the 4x4 NoC depicted in figure 7 is useful. As already discussed, sending a packet along path II using the e2e scheme results in a data corruption. This corruption can at the earliest be discovered at the destination D5. As the clear fault location can not be identified, the entire path composed of {S2, L4, Sw3, L5, Sw4, L6, Sw5, L7, Sw6, L8, Sw7, L9, D2} is added to the list of suspicious components. Since this was the first packet that passed the network, path II corresponds to the set of suspicious components. Subsequent another packet is send from Sw4 to D2. Because this path is free of faulty components, the packet reaches its destination as expected. As a result, the components between sender and destination are removed from the set of suspicious components, because they are verified to be fault-free, which results in {S2, L4, Sw3, L5}. Another packet send from S2 to Sw3 results in the path {S2, L4, Sw3}. As this transmission is also fault-free these components are removed from the list. As a result the faulty component link L5, originally responsible for the corruption in the first packet, is unambiguously identified. Admittedly, this example was constructed in a way that the faulty network component L5 can easily identified by means of just three packets. During normal operation this online diagnosis methods requires much more packets as well as more time to clearly identify a faulty component. It can be inferred, that higher the diversity of paths, packets take through the network, the higher is the fault localization capability of this online fault diagnosis method. This method works quite well for smaller NoC sizes. On the other hand, in larger networks it might be complicated to keep track of all resources as well as packets in the network.

- 18 -

Page 20: Online diagnosis of Networks-on-Chip

4.2.3 Implicated Overheads

End-to-End, Switch-to-Switch and Code-Disjoint Detection Scheme:

In this comparison the recently introduced data fault detection schemes e2e, s2s and cdd are regarded, with respect to power consumption as well as area overhead. This comparison is mainly based on experimental results, presented in [2] that propose cdd as the brand-new remedy to address data faults. Actually these outcomes depict the extension of the results published in [3], that provide pros and cons of e2e and flit-level switch-to-switch (s2sf), with respect to power consumption, latency and area overhead. Both experiments use a 4x4 mesh-based NoC incorporating 16 switches as well as 16 PEs. Error detection is performed be means of parity-check codes. A uniform traffic pattern with an injection rate of 0.1 flits/cycle per PE is performed. Thereby each PE can dispatch messages with the same probability. One message length corresponds to 4 flits; the flit size equals 64 bits. Further details concerning the simulation conditions can be found in [2, 3].

Power Consumption:

Regarding power consumption overhead caused by additional error detection mechanism, different sources can be identified depending on the applied data fault detection scheme (e2e, s2s, and cdd). A power overhead equally implied in all schemes is due to addition extra-bits needed for error detection. In the end-to-end scheme a source of extra power consumption is the packet buffer, incorporated at the sending node, needed for retransmission. Additional energy consuming circuits for en- and decoding are incorporated at the source and destination. Data corruption introduced during transmission can only be identified at the destination. As a result retransmission has to be performed from the source to the destination, resulting in maximum penalty in terms of power consumption. In the s2s scheme retransmission buffers are incorporated in each switching element. Meaning that more retransmission buffers are needed, which results in higher power consumption compared to the e2e scheme. The circuit for encoding is incorporated at the source node; furthermore checking circuits (decoders) are needed at each switch input. On the other hand the power consumption penalty due to retransmission is much smaller, since retransmission is performed from the last switch and not from the very beginning. The code-disjoint detection scheme differs regarding power consumption in just one point. Beside error checking at the switch input cdd also performs error checking at the output.

Flit-error rate [%]

Po

we

r co

nsu

mp

tio

n [

mW

]

140

120

100

80

600.001 0.01 0.1 0.5 1 2 3 4

160

e2e s2sf cdd

Figure 8: Power consumption of e2e, s2sf, and cdd schemes [2].

Figure 8 depicts the results, presented in [2], regarding power consumption of the discussed error recovery schemes with respect to the flit-error rate. The flit error rate is defined as the percentage of flits that have one or more errors. As expected, e2e data fault detection has least power consumption for very small flit-error rates. One reason is that e2e requires minimum hardware overhead, in other words, buffers

- 19 -

Page 21: Online diagnosis of Networks-on-Chip

and checking circuits, necessary to implement error detection and correction capabilities. Assuming that errors do not occur very often, expensive power consuming retransmissions are seldom. With rising flit-error rates it can be deduced that power consumption due to retransmissions gains in importance. These assumptions do correlate with the results presented in figure 8. Regarding the results determined for the flit-level switch-to-switch (s2sf) scheme, it is not surprising that s2s in general consumes more power. This overhead mainly results from the retransmission buffers incorporated in every switch. Also, more error checking circuits are needed, which contributes to the augmented power consumption compared to e2e. Regarding higher error-flit rates the difference between s2sf and e2e decreases, as retransmission in s2sf is not performed from the very beginning. NoC size in this example is just 4x4. It can be inferred for larger network structures, that the power overhead difference between e2e and s2sf might become smaller, since the power consumption penalty for e2e rises due to the longer path lengths. Surprisingly the power consumption overhead for the cdd scheme, published in [2], is much smaller than expected. The authors of [2] “affirm” these results for their proposed code-disjoint detection as following: “This is because though it theoretically requires the same amount of retransmission buffers, not all of them are used effectively when an error is detected (if an error is detected at a switch input, the buffers of that switch are not used)” [2]. Actually this explains why there is a possibility that cdd scheme may work more power efficient for extreme high flit-error rates than the s2sf scheme – whether these simulations correlate with the real world is questionable. By the way, flit-error rates up to 4%, as shown in [2], are far-off errors expected during normal operation. The original (more reliable) investigations provide simulation results for the e2e and s2sf schemes just consider flit-error rates up to 1%. Assuming that no errors occur in the network (flit-error rate goes towards zero), power consumption of cdd and s2sf should be equal, because all retransmission buffers are used. Furthermore total power consumption of cdd should be higher than in s2sf, because cdd requires more error detection circuits. According to the simulation results in [2], power consumption of cdd for small flit-error rates is much lower than in s2sf; beyond that even very close e2e. To sum up, regarding the previous discussion the simulation results for cdd is not comprehensible. Besides the already given citation the authors of [2] do not give any additional explanations for this phenomenon.

Area Overhead:

The last comparison between e2e, s2s and cdd focuses on the area overhead, implied by adding error detection and correction mechanism to an existing NoC. Table 7 depicts the area needed to embed fault detection capabilities. The scheme “original” servers as reference architecture (4x4 NoC – 16 switches with 16 PE) without error detection and correction capabilities [3].

Scheme Area (mm2) Original 3.36 e2e 4.40 s2sf 5.76

Table 7: Area comparison of e2e, s2sf and cdd [3]. As anticipated the area overhead introduced by s2sf is much higher than in e2e. The s2sf scheme requires more retransmission buffers as well as error detection circuits. As cdd requires roughly twice as much error detection circuits compared to s2sf, it can be assumed that the resulting area overhead is much higher. A numerical value cannot be provided, because in [2] the authors did not consider the area overhead implied by adding error detection and correction mechanism.

- 20 -

Page 22: Online diagnosis of Networks-on-Chip

- 21 -

5 Conclusions

Online diagnosis of Networks-on-Chip gains more and more in importance, because the number of cores per chip will further increase in the future. Therefore this paper presented different online fault detection and localization methods, proposed for the design of reliable Networks-on-Chip in the Many-Core Era. At the beginning all components that compose a generic NoC switch were illustrated. A discussion was held how defects in these components influence the overall behavior of a switch. Thereby fault models serve as a viable abstraction tool in order to conceal detailed information. Subsequently these fault models were further classified into either control or data faults, as each of them requires different remedies to cope with its specific kinds of faults. Different control fault detection methods namely distraction detection, switch count, time-out detection, trapped packet detection, and sequence number were presented. In the following the fault coverage as well as localization capabilities of the control fault detection methods were discussed, eventually with respect to implicated overheads. Thereby the most powerful method to cope with control faults is distraction detection, with respect to fault coverage and localization as well as area overhead. This method serves as basis and can be extended by other control fault detection and localization methods in order to improve fault coverage and localization capabilities. Last but not least faults occurring in the data path of a switch were handled. It was shown, how these kind of faults can be addressed by data fault detection methods namely end-to-end, switch-to-switch and code-disjoint detection. Subsequently a comparison discussed fault coverage as well as fault localization capabilities with respect to the applied error detection and correction mechanism. Eventually the overheads (area and power consumption) implicated by embedding data fault detection methods into existing NoC structures were argued. It turned out that code-disjoint detection has the most sophisticated fault localization capabilities at the highest overheads. When applying fault detection and localization methods to NoCs one has to trade-off between having good fault coverage and localization capabilities, or an area and cost efficient design. These trade-offs have to be adjust according to the underlying application.

References [1] Armin Alaghi, Naghmeh Karimi, Mahshid Sedghi, and Zainalabedin Navabi. “Online NoC

switch fault detection and diagnosis using a high level fault model,” in Proceedings of the 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT), 2007, pp. 21-29.

[2] Cristian Grecu, André Ivanov, Res Saleh, Egor S. Sogomonyan, and Partha Pratim Pande. “On-line fault detection and location for NoC interconnects,” in Proceedings of the 12th IEEE International On-Line Testing Symposium (IOLTS), 2006, pp.145-150.

[3] Srinivasan Murali, Giovanni De Micheli, Luca Benini, Theocharis Theocharides, N. Vijaykrishnan, and Mary Jane Irwin. “Analysis of error recovery schemes for networks on chips,” IEEE Design & Test of Computers, vol. 22, no. 5, pp. 434-442, Oct. 2005.

[4] Naghmeh Karimi, Armin Alaghi, Mahshid Sedghi, and Zainalabedin Navabi. “Online network-on-chip switch fault detection and diagnosis using functional switch faults,” Journal of Universal Computer Science, vol. 14, no. 22, Dez. 2008.

[5] Luca Benini and Giovanni De Micheli. “Fault models for NoCs,” in Networks on Chips: Technology and Tools, vol. 1, Charles B. Glaser, Ed. San Francisco: Morgan Kaufmann, pp. 81-84, 2006.