40
Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D Network-on-Chip Architectures Akram Ben Ahmed, Abderazek Ben Abdallah The University of Aizu, Graduate School of Computer Science and Engineering, Adaptive Systems Laboratory, Fukushima-ken, Aizu-Wakamatsu-shi 965-8580, Japan E-mail: {d8141104, benab}@u-aizu.ac.jp Abstract Three-Dimensional Networks-on-Chip (3D-NoC) has been presented as an auspicious solution merging the high parallelism of Network-on-Chip (NoC) interconnect paradigm with the high-performance and lower interconnect-power of 3-dimensional integration circuits. However, 3D-NoC systems are exposed to a variety of manufacturing and design factors making them vulnerable to dierent faults that cause corrupted message transfer or even catastrophic system failures. Therefore, a 3D-NoC system should be fault-tolerant to transient malfunctions or permanent physical damages. In this paper, we present an ecient fault-tolerant routing algorithm, called Hybrid-Look- Ahead-Fault-Tolerant (HLAFT), which takes advantage of both local and look-ahead routing to boost the performance of 3D-NoC systems while ensuring fault-tolerance. A deadlock- recovery technique associated with HLAFT, named Random-Access-Buer (RAB), is also presented. RAB takes advantage of look-ahead routing to detect and remove deadlock with no considerably additional hardware complexity. We implemented the proposed algorithm 1

Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

for 3D Network-on-Chip Architectures

Akram Ben Ahmed, Abderazek Ben Abdallah

The University of Aizu,

Graduate School of Computer Science and Engineering,

Adaptive Systems Laboratory,

Fukushima-ken, Aizu-Wakamatsu-shi 965-8580, Japan

E-mail: {d8141104, benab}@u-aizu.ac.jp

Abstract

Three-Dimensional Networks-on-Chip (3D-NoC) has been presented as an auspicious

solution merging the high parallelism of Network-on-Chip (NoC) interconnect paradigm with

the high-performance and lower interconnect-power of 3-dimensional integration circuits.

However, 3D-NoC systems are exposed to a variety of manufacturing and design factors

making them vulnerable to different faults that cause corrupted message transfer or even

catastrophic system failures. Therefore, a 3D-NoC system should be fault-tolerant to transient

malfunctions or permanent physical damages.

In this paper, we present an efficient fault-tolerant routing algorithm, called Hybrid-Look-

Ahead-Fault-Tolerant (HLAFT), which takes advantage of both local and look-ahead routing

to boost the performance of 3D-NoC systems while ensuring fault-tolerance. A deadlock-

recovery technique associated with HLAFT, named Random-Access-Buffer (RAB), is also

presented. RAB takes advantage of look-ahead routing to detect and remove deadlock with

no considerably additional hardware complexity. We implemented the proposed algorithm

1

Page 2: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

and deadlock-recovery technique on a real 3D-NoC architecture (3D-OASIS-NoC1) and

prototyped it on FPGA. Evaluation results show that the proposed algorithm performs better

than XYZ, even when considering high fault-rates (i.e., ≥ 20%), and outperforms our

previously designed Look-Ahead-Fault-Tolerant routing (LAFT) demonstrated in latency/flit

reduction that can reach 12.5% and a throughput enhancement reaching 11.8% in addition

to 7.2% dynamic-power saving thanks to the Power-management module integrated with

HLAFT.

Keywords: 3D-NoC ; Architecture ; Parallel ; Adaptive ; Look-ahead routing ; Deadlock-free;

1 Introduction

During the past few decades, technology enabled the aggressive scaling and continuous shrinkage

of transistors dimension on modern microchips. This made the integration of billions of transistors

on a single chip achievable [1]. As an example, the Intel Xeon processor [2] includes 2.3 billion

transistors. With such high integration level available, the development of many cores on a single

die has become possible. For instance, the Tilera Tile64 [3] and Intel Polaris [4] contain 64 and

80 cores, respectively. As the number of cores keeps increasing, and in order to take advantage

of the large number of cores, the employment of efficient and scalable interconnect fabrics has

become imperative. Traditional on-chip interconnect schemes, such as Point-to-Point, shared-

bus, and the crossbar, are no longer reliable to provide the necessary communication among the

processor cores. Recently, Network-on-Chip (NoC) [5, 6] has been proposed as a potential solution

to respond to the high interconnection demands arising in the nanometer era.

Based on a simple and scalable architecture platform, NoC connects processors, memories and

other custom designs together using switches to distribute packets on a hop-by-hop basis in order

to increase the bandwidth and performance and solve the interconnect bottleneck in traditional

1This project is partially supported by Competitive research funding, Ref. P1-5, Fukushima, Japan.

2

Page 3: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

bus-based systems. On the other hand, as the number of cores keeps increasing in hyper-core

systems, spreading hundreds of cores in a 2D plane may not satisfy the high requirements of

future data-intensive applications. This is mainly caused by the increasing diameter of 2D-NoC

that does not scale with such large number of cores.

At the same time, three dimensional integrated circuits (3D-ICs) [7] have attracted a lot of

attention as a potential solution to resolve also the interconnect bottleneck. Thanks to the reduced

average interconnect length, 3D-ICs can achieve higher-performance and a lower interconnect-

power consumption can be obtained [8]. Moreover, the realization of mixed technology has

become possible [9]. Combining the NoC structure with the benefits of the 3D integration offers

a promising 3D-NoC architecture. The combination of these two major trends provides a new

horizon for NoC designs to satisfy the high requirements of future large-scale applications.

As 3D-NoC architectures started to show their outperformance and energy efficiency against

2D-NoC systems, questions about their reliability to sustain their performance growth begun to

arise [10]. This is mainly due to challenges inherited from both 3D-ICs and NoCs: On one side,

the complex nature of 3D-IC fabrics and the continuing shrinkage of semiconductor components.

Furthermore, the significant heterogeneity in 3D chips which are likely to mix logic layers with

memory layers and even more complex technologies increases the fault’s probability in a system

[11]. On the other side, the single-point-failure nature of NoC introduces a big concern to their

reliability as they are the sole communication medium. As a result, 3D-NoC systems are becoming

susceptible to a variety of faults caused by crosstalk, electromagnetic interferences, impact of

radiations, oxide breakdown, and so on [12]. A simple failure in a single transistor caused by one

of these factors may compromise the entire system reliability where the failure can be illustrated

in corrupted message delivery, time requirements unsatisfactory, or even sometimes the entire

system collapse. Faults can occur at any component of a 3D-NoC system (i.e., link, router, buffers,

crossbar etc.) and their rate of occurrence depends on the design, technology, environment and

3

Page 4: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

operation conditions.

From a time perspective, the duration of faults is very important especially for real-time 3D-

NoC systems and it can be categorized into three main types [13]: 1) Transient faults: they occur

and remain in the system for a particular period of time before disappearing; 2) Intermittent

faults: they are transient faults which occur from time to time; 3) Permanent faults: they start

at a particular time and remain in the system until they are repaired. Consequently, reliable

3D-NoC systems should be able to detect first all the fault types occurrence, then working on

reconfiguring the system resources to recover from these faults and guarantee the continuous

correct functionality of the system.

Fault detection is very crucial phase in fault-tolerant 3D-NoC systems. It can be obtained

by relying on custom testing mechanisms or other detection schemes based on codes. Codes

are mostly used in NoC systems and they were proposed to detect and correct errors in specific

components of the system at the presence of a specific type of fault. For instance, Crosstalk

Avoidance Codes (CAC) [14] are used for transmission wires and they are considered more

efficient than the already existing methods to avoid crosstalk. For errors whose presence could

not be detected, Error Detection Codes (EDC) and Error Correcting Codes (ECC) [15] are used

to detect and correct these errors.

Detection and checking mechanisms in 3D-NoC systems are out of the scoop of this paper and

we are mainly interested in avoiding the detected faults by using a light-weight adaptive routing

to avoid information loss or corruption. On the other hand, deadlock is one of the problems

that may come with adaptive routing. Deadlock is caused when packets in different buffers are

unable to progress because they are dependent on each other forming a dependency cycle. Most of

the existing 3D-NoC systems used Virtual-Channel (VC) [16] as a deadlock-avoidance technique

while others employ Virtual-Output-Queue (VOQ) [17] or their routing algorithms are based on a

turn-model that adds restrictions to the routing choices to avoid deadlock. These procedures insure

4

Page 5: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

the deadlock freedom at the expense of the hardware and implementation complexity or latency

penalty caused by the routing restrictions.

Based on these facts, in this paper we propose a novel low-latency, high-throughput and

fault-tolerant routing algorithm, that we called Hybrid-Look-Ahead Fault-Tolerant (HLAFT).

HLAFT takes advantages of the high-throughput and low-overhead of our previously implemented

routing algorithm, Look-Ahead-Fault-Tolerant (LAFT) [18, 19], and combines it with local-

routing for better routing decision while simultaneously guaranteeing fault tolerance with graceful

performance degradation. The proposed algorithm can detect if the route calculated for the next

node leads to a blocking path and recalculate the route to save the unnecessary clock cycles

needed for nonminimal routing. To ensure deadlock-freedom, we employed a low-cost technique

called Random-Access-buffer (RAB). RAB detects first the deadlock occurrence then manages

to drop the blocking request and looks for other ones to free some slots in the buffer and break

the dependency cycle causing the deadlock. Both HLAFT and RAB were implemented on 3D-

OASIS-NoC [20, 21] architecture which is a natural extension of our previous robust and flexible

2D-OASIS-NoC system [6, 23]. We prototyped the proposed system on FPGA and evaluated its

hardware complexity and performance under different fault-rates and traffic loads using Transpose

and Uniform synthetic workloads and also Matrix-multiplication and JPEG-encoder applications.

The rest of the paper is organized as follows: In Section 2, we present some of the related

work to fault-tolerant routing algorithms used in most popular 2D- and 3D-NoC systems. The

proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) and its main features

are explained in Section 3. In Section 4, we present Random-Access-Buffer technique for

deadlock-recovery. Section 5 is dedicated for the evaluation methodology and results, and finally

we end the paper with the conclusion and future work in Section 6.

5

Page 6: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

2 Related Work

Many works have been conducted so far to tackle fault-tolerance in NoC systems where they can

be classified depending the target system, the fault’s type, or the faults’ handling mechanism (e.g.,

using routing algorithms or architectural solutions). The majority of the fault-tolerant solutions

were proposed for 2D-NoC systems. Some of them added restrictions to the number of faults as a

security requirement for their systems. For instance, a single link or single node failure tolerance

was presented in [24]. For n-dimensional mesh, Duato et al. [25] presented a routing algorithm

that can tolerate up to n-1 faults.

Another part of the proposed solutions eliminated the number of faults restriction and, instead,

they limited the location of faults to a specific part of the network. They called these restricted

subnetworks fault-regions which can be disabled, if necessary, to ensure fault-tolerance of the

system. These restrictions can be represented by the shape of the fault-regions [26], their locations

(excluding faults located at the edges of the network [27]), or assuming the absence of faults in

some specific parts of the router [28]).

Some works were presented without adding any restriction to either the number of faults or

location. For example, uLBDR [29] is a routing scheme for 2D mesh topology based on Virtual-

cut-through which incurs a high complexity despite its great performance. The work in [30] is

based on stochastic approaches to tolerate transient and permanent faults.

Other works [31, 32] focused on the importance of adopting minimal routing algorithms to

reduce the congestion caused by the presence of faults. They proposed minimal routing schemes

that are able to adaptively route packets through the shortest paths, as long as a path exists.

Authors in [33], presented a comprehensive investigation about 2D/3D fault-tolerant routing

algorithms’ implementation. They discussed the different turn models, fault conditions, and the

fault occurrence in both links and routers.

6

Page 7: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Some of the proposed fault-tolerant solutions for 3D-NoC systems primarily focused on

permanent faults occurring in Through-Silicon-Vias (TSVs). For example, Loi et al. [34]

addressed the TSV yield improvement by presenting a solution based on spare via insertion.

Pasricha et al. [35] introduced serialization to achieve also the same goal. On-chip sequential

elements protection techniques were also proposed in [36, 37] to deal with single event upsets.

The other existing works for 3D-NoC systems focused on link failure in general by adopting

fault-tolerant routing algorithms. For example, Rahmani et al. [38] presented a fault-tolerant

routing algorithm for named AdaptiveZ targeted for Hybrid-3D-NoC architecture. Feng et al.

[41] proposed a low overhead deflection routing algorithm based on routing tables. However, the

deployment of routing tables is still costly in terms of hardware and suffers from poor scalability

due to the area required for the tables.

Nordbotten et al. [39] presented an adaptive and fault-tolerant routing for 3D meshes and tori

based on storing at each node a table for routing containing information about destination and the

list of intermediate nodes. A fault-tolerant routing algorithm for 3D torus was presented by Yamin

et al. [40]. With up to 30% faulty-node rate, the proposed algorithm can find a valid link between

two faulty nodes with a probability higher than 90%. The Planar Adaptive Routing (PAR) for large

scale 3D topologies was presented by Chien et al. [42] where they used three Virtual-Channels

(VCs) [16] for deadlock-recovery and used also some routing rules. In [43] a fault model based

on PAR, named 3D minimum-connected component (MCC), was presented. Later, Xiang et al.

[44] proposed a fault model, named planar network (PN), that needs to store much less safety

information at each faulty-free node.

The problem with the solutions mentioned above is that they use VCs which are costly in terms

of hardware and implementation complexity. This is caused by the arbitration needed to handle

the different requests coming from the multiple VCs at each input-port. Thus, they introduce an

extremely expensive hardware complexity and implementation cost besides the energy overhead

7

Page 8: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

making them undesirable for the 3D-NoC implementation.

To avoid using VCs while insuring deadlock-freedom, Pasricha et al. [45] extended a 2D

turn-model for partially adaptive routing to the third dimension. The proposed scheme combines

both 4N-FIRST and 4P-FIRST schemes to propose a lightweight 4NP-FIRST. On the other hand,

this turn-model introduces some routing restriction to prevent from deadlock. These restrictions

cause a nonminimal routing selection where in some cases it may take too many additional hops

for the packet to reach its destination.

A recent work by Akbari et al. [46] focused mainly on faults links and proposed an algorithm

named AFRA. This algorithm routes packets through ZXY in the absence of link faults. When

faults are detected, flits are sent first to an escape node through X, and then they continue to

their destination through ZXY as it is in the no-fault case. The authors presented simplicity,

good performance and robustness as the main features of this algorithm. However, this solution

has two major drawbacks: First, AFRA tolerates only vertical links and horizontal links are not

taken into consideration; second, the network congestion status is not taken into consideration,

especially when more than one possible minimal path is available. In that case, a static minimal

path selection does not eventually lead to low latency (i.e. a minimal path may contain a high

congestion while other alternative ones do not).

Ebrahimi et al. [47] stated that AFRA requires a fault distribution mechanism to know

about the fault information on all vertical links along each row. They proposed a more reliable

routing algorithm named HamFA which does not require any fault distribution, additional fault

information, or any virtual channels. They modified the basic form of the Hamiltonian path in

order to be able to switch between high and low channel subnetworks. In fact, HamFA provides

much more reliability than AFRA; however, it requires that the faulty links should belong to the

same subnetwork in order to tolerate multiple vertical or horizontal one-faulty links. As a result,

this limits the reliability of this approach to 95% when considering the presence of a single faulty

8

Page 9: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

link. In our approach, no restrictions are added to the fault locations and they can be anywhere in

the network as long as there exists a valid path between a given (source, destination) pair.

3 Hybrid-Look-Ahead-Fault-Tolerant Routing Algorithm

3.1 Look-Ahead-Fault-Tolerant routing (LAFT) drawbacks

To keep the benefits of look-ahead routing [22], Look-Ahead-Fault-Tolerant routing algorithm

(LAFT) [18, 19] should be able to perform the routing decision for the next node taking into

consideration its link status and selects the best minimal path. The fault information is read by

each input-port where LAFT handles the routing computation. The first phase of this algorithm

calculates the next node address depending on the next-port identifier read from the flit. For a

given node wishing to send a flit to a given destination, there exist at most three possible directions

through X, Y, and Z dimensions respectively. In the second phase, LAFT performs the calculation

of these three directions by comparing x, y and z coordinates of both current and destination

nodes concurrently. At the same time, as these directions are being computed, the fault-control

module reads the next-port identifier from the flit and sends the appropriate fault information to the

corresponding input-port. By the end of this second phase, LAFT has information about the next

node fault status and also the three possible directions for a minimal routing. In the next phase,

the routing selection is performed. For this decision, we adopted a set of prioritized conditions

to ensure fault-tolerance and high performance either in the presence or absence of faults: 1)

The selected direction should ensure a minimal path and it is given the highest priority in the

routing selection, 2) we should select the direction with the largest next hope path diversity, and

the congestion status is given the lowest priority.

To understand better how LAFT works, we observe Fig. 1 (a). Assuming that the current node

(labeled C) received an incoming flit where the next port identifier, calculated in the previous node,

9

Page 10: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

indicates that the out-port for this flit is East (Red arrow). The next node address is calculated

(labeled N). Three minimal directions are possible for routing: East, North or Up. The East

direction will not be selected since the link in this direction is faulty. Therefore, either North or

Up can be selected, which both are minimal and nonfaulty. In this case, the diversity priority [18]

is taken into consideration. If Up is selected, where the node in this direction is on one of the

network edges, the diversity value is equal to 2 (two minimal possible directions: East or North).

But, if North is selected, its diversity value is equal to 3 (East, North or Up). Having the highest

priority, the North out-port (Green arrow) is selected for the next node and it is embedded in the

flit to be used in the downstream node to allow the routing calculation and switch allocation to be

performed both in parallel. More details about the different steps of the algorithm can be found in

[18].

Employing look-ahead routing reduces the router latency and improves the system overall

performance. However, due to the limited information restricted to only one switch ahead, in

some cases LAFT may not select the best route. Observing the example in Fig. 1 (a), we can see

that the North route is leading to a blocked path where all the minimal possible directions to the

destination are faulty. Therefore, nonminimal routing is required causing several additional clock

cycles. When arriving to the next node (labeled N), the out-port is already decided and sent to the

Switch-allocator due to the look-ahead routing restraints. This is despite the fact that this path is

leading to a blocking path and also a better blocking-free route exists (Up). Therefore, optimizing

LAFT and making it able to detect whether the route precalculated in the previous upstream node

would lead to a blocked path or not is necessary. To enable this, the benefits of look-ahead routing

should be combined with local routing for better routing selection.

10

Page 11: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

3.2 Proposed Hybrid-Look-Ahead-Fault-Tolerant routing (HLAFT) algorithm

The proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) solves the earlier

mentioned problems of LAFT. At every incoming flit, HLAFT makes a simple computation to

judge whether the precalculated Next-port identifier will lead to a blocking path or not. In case

where a possible nonminimal route might occur, HLAFT recomputes the route depending on the

local and neighboring nodes fault status. Algorithm 1 represents the proposed HLAFT routing.

At first, the fault-control module at each input-port reads the Next-port identifier and destination

addresses from the flit. Depending on the Next-port value, the next-node can be calculated. After

acquiring information about the next-node, the three possible minimal directions are computed

and checked whether a valid minimal route leading to the destination exists or not. In case where

at least a valid minimal link leading to the destination exits, the Next-port identifier is sent to the

Switch-allocator and the look-ahead-routing-computation (LA-RC) modules at the same time. In

the opposite case (i.e., all the possible three directions are faulty), a flag is issued triggering the

local-routing-calculation module (Loc-RC) to recalculate the output-port by verifying the status of

the current and neighboring nodes links. The same three priorities, earlier explained in the previous

subsection (i.e., minimal valid, higher diversity, and less congested), are taken into consideration

while selecting the new out-port. The recalculated out-port issued from Loc-RC is sent to the

Switch-allocator to perform the arbitration, and also sent to the LA-RC module to compute the

Next-port for the next-node to continue the look-ahead routing cycle.

As illustrated in Fig. 1 (b), when the flit arrives to the node labeled C and having a current out-

port North precalculated in the previous node, the fault-control checks the fault status of the links

of the next-node in the north direction. When performing the checking, all the possible minimal

paths are found faulty; thus, Loc-RC is triggered and computes Up as the new best route (since

East is faulty). Then, the newly calculated Up direction is sent to the LA-RC module which issues

North as the Next-out-port.

11

Page 12: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Figure 2 (a) depicts the LAFT pipeline stages which takes advantage of look-ahead routing

to parallelize both Next-Port-Calculation (NPC) and Switch-Arbitration (SA) stages. For the

proposed HLAFT algorithm, illustrated in Fig. 2 (b), we can observe first that a pipeline stage

is dedicated for the fault-control module (FTC) to issue the flag and whether trigger local-routing

or not. This simple computation is done in parallel with the Buffer Writing stage (BW) to further

reduce the computation latency in the router without creating any critical paths. In case where no

flag is issued, the Local-Routing-Calculation stage (Local-RC) is bypassed and then Next-Port-

Calculation and Switch-Arbitration stages are performed in parallel (NPC/SA). In the presence

of a flag, Local-RC is performed and then the results are sent to NPC/SA stage before the final

Crossbar Traversal stage (CT) handles the flit transfer to the next neighboring node.

4 Random-Access-Buffer Scheme for Deadlock-Recovery

As is the case for every adaptive routing algorithm, the deadlock issue may rise. As we previously

mentioned in Section 1, most of the existing routing algorithms use either Virtual-channels (VCs)

or add restrictions to the routing selection to avoid deadlock. These solutions either suffer from

high implementation complexity or incur an additional delay due to the nonminimal approach. In

our case, we implemented a similar technique to VCs, but it is much simpler and less complex.

This technique, named Random-Access-Buffer (RAB), detects first the flit being the reason of

deadlock in the buffer, drops its request and then looks for another flit whose request can be

granted to free some slots in the buffer and break the dependency.

Figure 3 shows an example how RAB works. In each input-port, the RAB-controller (RAB-

cntrl) manages the detection of deadlock and handles the assignment of wr-adr and rd-adr

addresses. The detection mechanism is based on a timer which after a period of time, if the

request being processed is not served, a flag is issued informing the presence of a deadlock (Fig.

3 (a)). This is done by reading the sw-grnt signal received from the Switch-allocator. In this case,

12

Page 13: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

the BC reads the head of the next packet in the buffer and checks whether the requested out-port is

different from the one previously flagged as blocked or not. When it finds a request whose channel

is free (Fig. 3 (b)), it sends a request to the Switch-allocator to be served. When the request is

granted, the flits of the granted packet are read from the buffer and the freed slots can be used to

host another incoming packet (Fig. 3 (c)). After new flits are written in the buffer, the blocked

packet is checked again (Fig. 3 (d)). The RAB-cntrl receives a grant for the direction requested

(North) and the packet is read from the buffer.

The timer’s value is one of the important choices that should be carefully taken. This is

because the deadlock occurrence is strongly dependent on the application for any adaptive routing

algorithm. Moreover, the presence of faults in the system adds a higher probability for the

deadlock occurrence; therefore, before selecting the value of the timer, we profiled each one of the

used applications and analyzed the task graph of each one of them. During this analysis we took

into consideration the fault-rate and its impact on the congestion and deadlock occurrence.

In order to avoid any flit overwriting caused by mismanaging the wr-adr and rd-adr addresses,

we added a small status register (labeled SR in Figs. 3 and 4) which keeps information about the

flits being flagged (i.e., not served yet). When a flit is read from the buffer, the BC checks the

status register and makes sure not to write the incoming flit in a flagged slot. Observing, Figs. 3

(1) and (2), we can see that status[0] and status[1], which are keeping information about the two

flits of the flagged packet are updated to 1. As shown in Fig. 3 (3), when an incoming flit arrives to

the buffer, RAB-cntrl makes sure that it will not be stored in one of these two flits’ locations, but

instead it is stored in an empty one. When a flit which was previously flagged, and after of period

of time its request is granted, the RAB-cntrl should update the status register to free the flagged

slot so it can be used by other incoming flits, as it is show in Fig. 3 (4) (status[0] is updated to 0).

With this simple technique, performance and deadlock-freedom are ensured while guarantee-

ing a small overhead. Moreover, instead of managing many requests at the same time, as it is the

13

Page 14: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

case of VCs requiring then additional complexity and delay for the arbitration, RAB mechanism

handles each request at a time. Despite the delay penalty required by the timer to detect the

deadlock, this technique is still faster and simpler to implement than Virtual-channels.

Livelock freedom As long as the chosen route is minimal, the livelock problem does not exist

either. However, it can be observed when a nonminimal direction is selected. For this reason,

some restrictions are added when selecting the nonminimal route [18]. The first restriction forbids

the flit to turn back to the same direction where it came from. The second one forbids selecting a

path which is in the opposite direction of the faulty link (i.e., if ”East” is faulty then ”West” should

not be selected). Adopting these restrictions guarantees the livelock freedom of HLAFT, and the

flits will continue to advance and search for a route until it finds a valid link.

Power management For the proposed system, we used a power management scheme to reduce

the dynamic power in the system. This scheme is based on clock-gating to turn-off the clock in

some specific components which, in some periods of time, they are in a waiting state and not

performing any computation. Therefore, it is important to switch off these components when they

are not necessary for the correct working of the system and they are awakened when the system

requires their services.

We employed the power management scheme in three main components. First, when a router

detects that one of its links is faulty, it immediately switches off the clock from the entire input-

port associated to this faulty link. In this case, an important amount of dynamic power can be

saved since input-ports occupy the largest portion of the system power budget due to the presence

of buffers which are considered as the hungriest components in the entire router in terms of power.

Second, we turn off the clock in the local-routing module that we added to recompute the incoming

route if necessary. Thus, the local-routing module is initially not fed by the clock. Whenever

the fault-control module signals the necessity to recompute the route, the local-routing module

14

Page 15: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

is awakened and performs the necessary computation. Otherwise, it is disconnected from the

clock and the unnecessary dynamic power is spared. Finally, the last component is the RAB-

manger circuit (See Fig. 4). As we previously mentioned, RAB deadlock-recovery technique is

triggered only at the presence of deadlock. When no blocking happens, the buffer is managed by

the conventional First-In-First-Out (FIFO) buffer manager, as shown in Fig. 4. For dynamic-power

saving, RAB-manager is put asleep at the absence of deadlock (deadlock-flag is equal to zero) and

it is working only when the timer (deadlock-flag) informs it that deadlock is occurring. In this

fashion, deadlock-recovery is insured at a low-cost with no unnecessary power waste.

5 Evaluation

5.1 Evaluation methodology

Our proposed algorithm is implemented on 3D-OASIS-NoC system [20, 21, 22] which was

designed in Verilog HDL. For the hardware synthesis we used Altera Quartus II CAD tool, and

Power Play Analyzer to extract both static and dynamic power of the proposed system [48]. In our

evaluation, we conducted two experiments: in the first one, we evaluated the hardware complexity

and performance of 3D-OASIS-NoC and compared it to that of the baseline 2D-OASIS-NoC

system. In the second experiment, we evaluated the hardware complexity of HLAFT router in

terms of area utilization, power consumption (static and dynamic) and speed. To evaluate the

performance of the proposed algorithm, we selected Matrix-multiplication [49] and JPEG-encoder

[50] as real benchmarks and also two traffic patterns: Transpose [51] and Uniform [52]. We chose

Matrix-multiplication because it is one of the most fundamental problems in computer sciences

and mathematics, which forms the core of many important algorithms such as engineering and

image processing applications [49]. To evaluate 3D-OASIS-NoC system’s performance with

Matrix-multiplication, we set the matrix size to a 6×6. We also decided to calculate from 1 to

15

Page 16: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

100 different matrices at the same time. This aims to increase the number of flits traveling the

network at the same time and see the impact of congestion on the performance of the proposed

system with different traffic loads.

The second application used for the evaluation is the JPEG-encoder application [50] which is a

well-known application frequently used in a lot of research and includes some parallel tasks. This

can be suitable to evaluate the performance of NoC systems. To increase the network size and

introduce more parallelism, we implemented four JPEG systems that run concurrently and whose

tasks share three shared memories where the input and output images are stored.

The Transpose traffic pattern is a communication method based matrix transposition. Each

node sends messages to another node with the address of the reversed dimension index [51]. The

Transpose workload is often used to evaluate the NoC throughput and power consumption since

it creates a bottleneck due to the long communication distance exhibited between (transmitter and

receiver) pairs.

The Uniform traffic pattern is a standard benchmark used in on-chip and off-chip network

routing studies which can be considered as the traffic model for well-balanced shared memory

computations [52]. Each node sends messages to other nodes with an equal probability (i.e.,

destination nodes are chosen randomly using a uniform probability distribution function).

In our evaluation with the two traffic patterns, we set 4x4x4 as a network size where all the

nodes were assigned for both transmitter and receiver nodes. Each transmitter node injects from

102 to 105 flits into the network. While on the other side, receiver nodes verify the correctness of

the received flits.

Using these four benchmarks, we conducted two experiments: in the first one we evaluated

the performance of 3D-OASIS-NoC when compared to the base 2D-architecture under different

routing algorithms (XY [53], XYZ [55], LA-XY [54], and LA-XYZ [22]). In the second

experience, we evaluated the latency per flit, the throughput, and the dynamic-power of the

16

Page 17: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm under each of the aforementioned

application. We observed the performance variation of the proposed system under different fault

link rates: 0%, 1%, 5%, 10%, 15% and 20%. The number of links in each system can be calculated

using the formula shown below [56]:

#links = N1N2(N3 − 1) + N1N3(N2 − 1) + N2N3(N1 − 1) (1)

Where N1, N2 and N3 are the respective network’s X, Y and Z dimensions. During the evaluation,

we divided the faults into two categories: Half of the faults are permanent (considered during

the whole simulation time) and the second half are transient (randomly start and end along the

simulation time). In addition, as much as the fault-rate increases, we employed more faults in flits’

paths to cause nonminimal routing and observe the system behavior in a worst case environment.

All the results obtained with HLAFT were compared with LAFT [18], LA-XYZ [22], and also

Dimension-Order-Routing XYZ [53]. Table 1 represents the configuration parameters used for

our evaluation.

5.2 3D Vs. 2D NoC architectures comparison results

5.2.1 Hardware complexity evaluation

Table 2 summarizes the hardware complexity of both 2D- and 3D-OASIS-NoC architectures based

on XY, LA-XY (2D), XYZ, and LA-XYZ (3D). When we extend the 2D-architecture to the third

dimension (XY to XYZ or LA-XY to LA-XYZ) we observe, an area penalty which was about

37% in average. This can be explained by the additional two input-ports used by inter-layer links

to connect the two adjacent layers. This results in additional buffer resources usage in addition

to a larger crossbar. We also evaluated the effect of look-ahead routing on both architectures.

Adopting look-ahead incurs an additional 4.6% extra area in average. This additional area can be

explained by the additional bits needed for the Next-port identifier. In terms of speed, LA-XYZ-

17

Page 18: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

based system has the slowest speed which is caused by the additional hardware that we previously

explained. In terms of total power consumption, 3D-architectures show the best performance

illustrated in an average reduction of 5.2%. This slight reduction is a result of a tradeoff between

the additional static power caused by the additional hardware when moving to third dimension,

and the dynamic power reduction thanks to reduced number of hops with 3D-architectures which

incurs less buffering, less arbitration, and less computation resulting in dynamic-consumption

reduction.

5.2.2 Communication latency evaluation

We evaluated the communication latency by calculating the latency per flit for each architecture

under each of the four applications. The latency per flit results are shown in Fig. 5. Extending the

system to the third dimension in Matrix application (Fig. 5 (a)) reduces the latency/flit to 33% in

average thanks to the reduced number of hops. Furthermore, adopting look-ahead routing (LA-XY

and LA-XYZ) offers 25% latency/flit reduction. This reduction is obtained thanks to the ability of

look-ahead routing to reduce the router delay and forward faster the flits to reach its destination.

As a result, when compared to XY-based system, LA-XYZ-based one reduces the latency/ flit up

to 55%. As a conclusion, 3D-architectures reduce the latency/flit with 22.5%, 24.5%, and 26%

with Transpose, Uniform, and JPEG applications, respectively, when compared to 2D-routing-

based systems. On the other hand, employing look-ahead routing offers also 28%, 29%, and 17%

latency reduction with Transpose, Uniform, and JPEG applications, respectively.

5.2.3 Throughput Evaluation

For the second 2D- and 3D-architecture evaluation, we calculated each system’s throughput.

Figures 6 (a), (b), (c), and (d) show the throughput results for each of the used application.

Transpose application (Fig. 6 (b)) shows the best throughput enhancement, when comparing 3D-

18

Page 19: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

to 2D-architectures, illustrated in 45% improvement. This is can be explained by the long-distance

communication omnipresent in this kind of traffic pattern. With this kind of communication, 3D-

architectures take advantage of the reduced number of hops to forward flits to their destination

faster than 2D-systems. When observing the look-ahead routing effect, we noticed that Matrix-

multiplication (Fig. 6 (a)) showed the best performance represented in 32% enhancement when

comparing LA-XYZ to Dimension-Order-Routing XYZ. This improvement can be explained by

the large size of the network (3×6×6) presenting long distance between source and destination

nodes, where the router delay plays an important role in the throughput.

5.3 3D routing algorithms comparison results

The second part of the evaluation takes into consideration fault-tolerance and compares the

proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) with other 3D-routings

(XYZ, LA-XYZ, and LAFT). This comparison is made in terms of hardware complexity (area,

speed, and static-power), communication latency, throughput, dynamic-power, and reliability.

5.3.1 Hardware complexity evaluation

Table 3 represents the hardware complexity results of 3D-OASIS-NoC’s router when employing

HLAFT, XYZ, LA-XYZ, and LAFT. To see the additional hardware required for the Random-

Access-Buffer (RAB) for deadlock freedom, we evaluated the proposed HLAFT’s router without

(HLAFT-based) and with RAB (HLAFT+RAB-based). When compared to XYZ and LAFT,

HALFT presents an increasing area evaluated to 23% and 34%, respectively. This area penalty

is a natural cause to the additional Loc-routing module added to recompute Next-port if needed.

Moreover, the additional control signals added to perform the flag decision and insure the correct

functionality of the proposed routing algorithm. Observing the additional area required for

the adopted deadlock-recovery technique, RAB (HLAFT+RAB-based) requires only less than

19

Page 20: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

7.1% of extra hardware when compared to HLAFT router with no deadlock-recovery mechanism

(HLAFT+RAB-based). In terms of static-power consumption, the addition of RAB does not

considerably affect the proposed algorithm’s power whose static-power consumption increased

by 2.8% and 5.6% when compared to LAFT and XYZ, respectively. In terms of speed, XYZ-

based router seems to be the fastest among the three other routers thanks to its low area while

HLAFT is 2.2% and 10.3% slower than LAFT- and XYZ-based routers, respectively.

5.3.2 Communication latency evaluation

For the performance evaluation, we first calculated the latency/flit for XYZ and LA-XYZ with

0% fault-rate (since both algorithms do not support fault-tolerance), and LAFT and HLAFT under

fault-rates varying between 0% and 20% and we observed the performance variation. Figures 7

(a), (b), (c), and (d) show the average latency/flit results for each of the used applications. At the

absence of faults, both LAFT and HLAFT have the same latency/flit. This means that the Local-

routing was not triggered and HLAFT has the same behavior as LAFT. In Uniform application

(Fig. 7 (c)) the latency/flit reduction with HLAFT and LAFT can reach the 48% and 33% when

compared to XYZ and LA-XYZ, respectively. This reduction is a result of the ability of both

algorithms to detect the congestion caused by such traffic and makes the best routing decision

for less congested channels. When we increased the fault-rate and placed the fault links in some

critical locations, the previous LAFT algorithm fails and deadlock happens at 20% (Matrix and

Transpose) and starting from 10% (Uniform and JPEG) fault-rates as illustrated in Figs. 7 (a), (b),

(c), and (d). In these figures, we can see that the latency tends to∞, which means the failure of the

system. On the other hand, HLAFT manages to avoid deadlock and, moreover, its latency is still

smaller than XYZ even under 20% fault-rate with JPEG application (Fig. 7 (d)) while observing

a small overhead when compared to LA-XYZ illustrated in 1.5% in Transpose traffic. In average

20

Page 21: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

and under 20% fault-rate, the latency/flit is increased with HLAFT with 13.75% and 3.2% when

compared to LA-XYZ and XYZ, respectively. It’s important to remind that despite their small

outperformance against the proposed HLAFT algorithm, both LA-XYZ and XYZ algorithms are

static schemes that do not support fault-tolerance and they can crash even at the presence of a

single faulty link.

When we reobserve the latency/flit results, we can see that even when LAFT succeeds to

avoid deadlock, its latency is always equal or higher than that of HLAFT. The outperformance

of HLAFT against LAFT can reach the 6.2 and 12.6% latency/flit average reduction under 5%

and 10% fault-rates, respectively, with Matrix application. This performance improvement is

obtained thanks to the employment of both local- and look-ahead routing which gives our system

the opportunity to recompute the route if it finds out that the already calculated one will lead to

a blocking path costing several clock cycles for nonminimal routing. In this fashion, HLAFT

optimizes the routing decision and minimizes the probability of finding a blocking path (that leads

to a nonminimal route) while keeping the routing efficiently minimal as long as one minimal path

exists. Thus, with the help of RAB, both fault-tolerance and deadlock-recovery are insured with

no considerable performance degradation.

5.3.3 Throughput Evaluation

In the second performance evaluation, we computed the throughput of each routing-based system

as shown in Figs. 8 (a), (b), (c), and (d). Under 0% fault-rate, the throughput enhancement

with LAFT and HLAFT can reach the 53% and 23% when compared to XYZ and LA-XYZ,

respectively, with Matrix application (Fig. 8 (a)). Under 5% and 10% fault-rates, HLAFT

keeps its advantage against XYZ for all applications, and even under 20% fault-rates, HLAFT

still outperforms XYZ with Matrix, Transpose and JPEG application observing 12% throughput

21

Page 22: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

degradation with Uniform traffic-pattern. When compared to LA-XYZ, HALFT’s throughput is

the highest under 5% and 10% fault-rates for all applications except for Uniform (10%) where we

observed a slight 2.3% degradation. The final throughput comparison considered the performance

of HLAFT and LAFT. As we previously mentioned, LAFT fails to avoid deadlock when the fault-

rate exceeds the 10%. For this reason, the throughput results of LAFT are not shown on these

figures and the throughput is considered equal to zero. However, HLAFT with the help of the

RAB deadlock-recovery technique, manages to detect and eliminate the deadlock. Moreover,

even when LAFT succeeds to avoid deadlock, its throughput is always equal or lower than that of

HLAFT. The enhancement of HLAFT against LAFT can reach the 5.4% and 11.8% under 5% and

10% fault-rates, respectively, with Transpose application.

5.3.4 Dynamic-power evaluation

In the third part of our evaluation, we evaluated the dynamic-power of each routing-based system.

In addition, we want to see the effect of the employed Power-manager (PM) for dynamic-power

reduction; therefore, we evaluated the proposed HLAFT algorithm without (HLAFT) and with

(HALFT+PM) PM. The average dynamic-power results are illustrated in Table 4. XYZ-based

system represents the one with the lowest dynamic-power consumption followed by LA-XYZ with

a slight 2.3% difference. The low dynamic-power of these two deterministic routing-based systems

is natural due to their simplicity and the absence of fault-tolerance. When evaluating LAFT, it

exhibits 7.2% and 5.1% higher average dynamic-power consumption than XYZ and LA-XYZ,

respectively, which increases as much as we increase the fault-rates where the computation is more

important than those at low fault-rates. When comparing LAFT with HLAFT, we can observe that

the proposed algorithm, without employing any power management, presents the highest dynamic-

power consumption illustrated by an average of 5% which also increases also linearly as we

22

Page 23: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

increase the fault-rate. This can be explained by the power overhead caused by both Local-routing

(to recompute the route) and RAB (for deadlock-recovery) which are triggered more frequently

at high fault-rates (10% and 20%). When adopting the Power-management (PM) for HLAFT, the

dynamic-power reduction with the optimized system can reach the 21.2% when compared to the

system with no PM. Furthermore, when no faults are detected in the system, HLAFT+PM has

almost the same dynamic-power consumption as in LAFT since both Local-routing module and

RAB are shutdown. As we employ faults, HLAFT+PM starts to take advantage against LAFT

that can reach the 7.2% at 20% fault-rate. At this high fault-rate, the computation performed

in HLAFT+PM is more important than that in LAFT, causing then additional dynamic-power

that can be seen in HLAFT without PM. However, when integrating the PM module 20% of the

input-port connected to the faulty-links are clock-gated providing an important dynamic-power

saving. Despite the small area penalty, the proposed system endorses fault-tolerance, provides

higher performance, and ensures deadlock-freedom. Moreover, all these benefits are obtained

with a minimal dynamic-power overhead to maintain the system fault-tolerant at low cost.

5.3.5 Reliability evaluation

In this subsection, we discuss the reliability of HLAFT. We define reliability as the capability of

the system to deliver all the packets to their destinations. If all packets are delivered except for

one, the system is considered as unreliable. We compared the results with both AFRA [46] and

HamFA [47] algorithms. The reliability results of these schemes are obtained from [47] where

Uniform traffic pattern was used and one, two, and three faulty links are considered.

HamFA could tolerate one, two, and three faulty links by 95%, 44%, 20% reliability,

respectively, while AFRA could tolerate 33%, 7%, and 3% for the same number of faulty

links, respectively [47]. For the proposed HLAFT, the system could deliver all the flits to their

destinations with no single dropped flit (100% reliability) when considering three faulty links.

23

Page 24: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Moreover, all the packets could reach their destinations when running all the used benchmarks,

even when considering a large number of faulty links that can reach sometimes 252 faulty links

(case of 20% fault rate in Matrix-multiplication application).

We can explain this reliability by the fact that HLAFT always finds a path form source to

destination, no matter where the fault is located and how heavy the traffic is. The only possible

reason that can prevent the arrival of a given packet to its destination is the presence of deadlock.

But thanks to the adopted RAB mechanism, deadlock was avoided and, therefore, all packets could

reach their destinations providing a full reliability under different fault-rates and with different

injection-rates.

6 Conclusion

In this paper, we presented a novel low-latency, high-throughput, and fault-tolerant 3D-Network-

on-Chip routing algorithm named Hybrid-Look-Ahead-Fault-Tolerant (HLAFT). HLAFT is a

minimal routing algorithm that combines the benefits of look-ahead routing and also local-routing

for better routing choices. Furthermore, HLAFT adopts a low-cost technique, named Random-

Access-Buffer (RAB), for deadlock-recovery. We implemented the proposed algorithm on a

real 3D-NoC architecture (3D-OASIS-NoC), which has shown good performance, scalability and

robustness. We prototyped HLAFT on an FPGA, evaluated its performance over real and large

benchmarks, and compared it with well-known 3D-NoC routing schemes.

We conducted two experiments where in the first one we showed the benefits of adopting

3D-architecture over the 2D-one illustrated in 33% latency/flit reduction and 45% throughput

enhancement. We also saw the outperformance of look-ahead routing against local-routing

depicted in an average of 24.7% latency/flit reduction and a throughput enhancement that can

reach the 33%.

In the second experiment we demonstrated that the proposed HLAFT algorithm introduces

24

Page 25: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

lower latency/flit reduction against Dimensional-Order-Routing (XYZ) even at 20% fault-rate.

Thanks to the adopted RAB deadlock-recovery technique, deadlock-freedom was ensured in

HLAFT while the previous algorithm (LAFT) fails at 10% fault-rate. HLAFT outperforms

LAFT even when deadlock does not occur in this latter. This outperformance is represented in

a latency/flit reduction that can reach the 12.5% and a throughput enhancement reaching 11.8%.

Thanks to the adopted Power-management scheme (PM), the dynamic-power increasing could be

controlled and an important saving could be obtained that reached 7.2% when compared to the

earlier LAFT routing.

Adopting the proposed HLAFT routing introduced an area penalty represented in 23% and

34% extra ALUTs when compared to XYZ and LAFT, respectively, while observing that the

adopted RAB deadlock-recovery technique cost only 7.1% extra hardware. In terms of static-

power consumption a small 2.8% and 5.6% overhead is observed when compared to LAFT and

XYZ, respectively, and also 2.2% and 10.3% decreasing speed was reported when compared to

the aforementioned algorithms, respectively.

As a future work, an in-depth thermal power study should be conducted to observe how the

performance gain obtained with the proposed algorithm would affect this design requirement,

as it is very crucial for 3D-Network-on-Chip architectures. Moreover, we want to consider the

occurrence of faults in other components of the systems, such as input buffers and crossbar. These

circuits consume a big portion of the system area, making them more vulnerable to transient and

permanent faults.

References

[1] Y. Xie, J. Cong, and S. Sapatnekar. Three-Dimensional Integrated Circuits Design.

Springer, 2010.

25

Page 26: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[2] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Vora. A 45

nm 8-core enterprise Xeon processor. In Proc. IEEE Asian Solid-State Circuits Conf., Nov.

2009, pages 9-12.

[3] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L.

Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger,

N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. TILE64 processor: A 64-

core SoC with mesh interconnect. In Proc. of the IEEE International Solid-State Circuits

Conference. Digest of Technical Papers, Feb. 2008, pages 88-598.

[4] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T.

Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile

sub-100-w teraFLOPS processor in 65-nm CMOS. IEEE Journal on Solid-State Circuits,

43(1):29-41, Jan. 2008.

[5] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks, Morgan

Kaufmann, 2004.

[6] A. Ben Abdallah and M. Sowa. Basic Network-on-Chip Interconnection for Future

Gigascale MCSoCs Applications: Communication and Computation Orthogonalization.

Proceedings of The TJASSST2006 Symposium on Science, December 2006.

[7] G. Philip, B. Christopher, and P. Ramm. Handbook of 3D Integration: Technology and

Applications of 3D Integrated Circuits. Wiley-VCH, 2008.

[8] A. W. Topol, J. D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A. Kumar,

G. U. Singco, A. M. Young, K. W. Guarini and M. Ieong. Three-dimensional Integrated

Circuits. IBM Journal of Research and Development, 50(4/5):491-506, July 2006.

26

Page 27: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[9] G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen. A Novel 3D Stacked MRAM Cache

Architecture for CMPs. IEEE 15th International Symposium High Performance Computer

Architecture, pages 239-249, February 2009.

[10] L. Benini and G. De Micheli. Networks on Chips: Technology and Tools. Morgan

Kauffmann, 2006.

[11] I. Loi, F. Angiolini, S. Fujita, S. Mitra, L. Benini. Characterization and Implementation

of Fault-Tolerant Vertical Links for 3-D Networks-on- Chip. IEEE Trans. On CAD of

Integrated Circuits and Systems, 30(1):124-134, Jan. 2011.

[12] A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A Reliable

Routing Architecture and Algorithm for NoCs. IEEE Trans. on CAD of Integrated Circuits

and Systems, 31(5):726-739, May 2012.

[13] A. Burns and A. Wellings. Real-Time Systems and Programming Languages Ada Real-Time

Java and C/Real-Time Posix. Addison Wesley, 2009.

[14] P. P. Pande, H. Zhu, A. Ganguly, and C. Grecu. Energy reduction through crosstalk

avoidance coding in NoC paradigm. In Proc. of the 9th EUROMICRO Conference on

Digital System Design, pages 689-695, 2006. IEEE Computer Society.

[15] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Applications.

Englewood Cliffs, NJ: Prentice-Hall, 1983.

[16] W. J. Dally. Virtual-channel flow control, IEEE Trans. on Parallel and Distributed Systems,

3(2):194-205, March 1992.

[17] Y. Tar and G. L. Frazier. High-performance multiqueue buffers for VLSI communication

switches. 15th Annual International Symposium on Computer Architecture, pages 343-354,

May-June 1988.

27

Page 28: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[18] A. Ben Ahmed and A. Ben Abdallah. Architecture and Design of High-throughput, Low-

latency, and Fault-Tolerant Routing Algorithm for 3D-Network-on-Chip (3D-NoC). The

Journal of Supercomputing, 66(3):1507-1532, December 2013.

[19] A. Ben Ahmed and A. Ben Abdallah. Fault-tolerant Routing Algorithm with Deadlock

Recovery Support for 3D-NoC Architectures. The 7th IEEE International Symposium on

Embedded Multicore SoCs, pages 67-72, September 2013.

[20] A. Ben Ahmed, A. Ben Abdallah, and K. Kuroda. Architecture and Design of Efficient

3D Network-on-Chip (3D NoC) for Custom Multicore SoC. IEEE Proceedings of the

5th International Conference on Broadband, Wireless Computing, Communication and

Applications, pages 67-73, November 2010.

[21] A. Ben Ahmed and A. Ben Abdallah. LA-XYZ: Low Latency, High Throughput Look-

Ahead Routing Algorithm for 3D Network-on-Chip (3D-NoC) Architecture. The 6th IEEE

International Symposium on Embedded Multicore SoCs, pages 167-174, September 2012.

[22] A. Ben Ahmed and A. Ben Abdallah. Low-overhead Routing Algorithm for 3D Network-

on-Chip. IEEE Proceedings of The Third International Conference on Networking and

Computing, pages 23-32, December 2012.

[23] K. Mori, A. Esch, A. Ben Abdallah and K. Kuroda. Advanced Design Issues for OASIS

Network-on-Chip Architecture. IEEE Proceedings of the 5th International Conference

on Broadband, Wireless Computing, Communication and Applications, pages 74-79,

November 2010.

[24] W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. The reliable router: A

reliable and high-performance communication substrate for parallel computers. In Proc. of

28

Page 29: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

the First International Workshop on Parallel Computer Routing and Communication, pages

241-255, 1994.

[25] J. Duato, A theory of fault-tolerant routing in wormhole networks. IEEE Trans. Parallel

Distributed Systems, 8(8):790-802, August 1997.

[26] Z. Zhang, A. Greiner, and S. Taktak. A reconfigurable routing algorithm for fault-

tolerant 2-D-mesh network-on-chip. In Proc. of the 45th ACM/IEEE Conference on Design

Automation Conference, pages 441-446, June 2008.

[27] P.-H. Sui and S.-D. Wang. Fault-tolerant wormhole routing algorithms for mesh networks.

IEEE Trans. on Computers, 44(7):848-864, January 2000.

[28] C. Liu, L. Zhang, Y. Han, and X. Li. A resilient on-chip router design through data path

salvaging. In Proc. of the 16th Asia and South Pacific Design Automation Conference, pages

437-442, January 2011.

[29] S. Rodrigo, J. Flich, A. Roca, S. Medardoni, D. Bertozzi, J. Camacho, F. Silla, and J.

Duato. Addressing manufacturing challenges with cost-efficient fault tolerant routing. 4th

ACM/IEEE International Symposium on Networks-on-Chip, pages 25-32, May 2010.

[30] P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm

for fault-tolerant networks-on-chip. VLSI Design, 2007(95348):17, 2007.

[31] M. Ebrahimi, M. Daneshtalab, J. Plosila, and H. Tenhunen. MAFA: Adaptive Fault-Tolerant

Routing Algorithm for Networks-on-Chip. In Proc. of 15th Euromicro Conference on

Digital System Design, pages 201-207, September 2012.

[32] M. Ebrahimi, M. Daneshtalab, J. Plosila, and F. Mehdipour, MD: Minimal path-based Fault-

Tolerant Routing in On-Chip Networks. In Proc. of 18th Asia and South Pacific Design

Automation Conference, pages 35-40, January 2013.

29

Page 30: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[33] M. Ebrahimi. Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip.

Routing Algorithms in Networks-on-Chip, Springer, 2014.

[34] I. Loi, S. Mitra, T. H. Lee, Sh. Fujita, and L. Benini. A low-overhead fault tolerance scheme

for TSV-based 3D network on chip links. In Proc. of the 2008 IEEE/ACM International

Conference on Computer-Aided Design, pages 598-602, 2008.

[35] S. Pasricha. Exploring serial vertical interconnects for 3D ICs. In Proc. of the 46th

ACM/IEEE Design Automation Conference, pages 581-586, July 2009.

[36] M. Nicolaidis. Design for soft error mitigation. IEEE Trans. on Device and Materials

Reliability, 5(3):405-418, September 2005.

[37] F. L. Kastensmidt, L. Carro, and R. Reis. Fault-Tolerance Techniques for SRAM-Based

FPGAs. Frontiers in Electronic Testing, Springer-Verlag, 2006.

[38] A. -M. Rahmani, K. R. Vaddina, K. Latif, P. Liljeberg, J. Plosila and H. Tenhunen. Design

and Management of High-performance, Reliable and Thermal-aware 3D Networks-on-

Chip. IET Circuits, Devices & Systems, 6(5):308-321, September 2012.

[39] N.A. Nordbotten, M.E. Gmez, J. Flich, P. Lopez, A. Robles, T. Skeie, O. Lysne, and

J. Duato. A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate

Nodes. In Proc. of IFIP international conference on Network and Parallel Computing,

pages 341-356, October 2004.

[40] Y. Li, Sh. Peng, and W. Chu. Adaptive box-based efficient fault-tolerant routing in 3D torus.

In Proc. of the 11th International Conference Parallel and Distributed Systems, pages 71-

77, July 2005.

30

Page 31: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[41] Ch. Feng, M. Zhang, J. Li, J. Jiang, Z. Lu and A. Jantsch. A Low-Overhead Fault-Aware

Deflection Routing Algorithm for 3D Network-on-Chip. IEEE Computer Society Annual

Symposium on VLSI, pages 19-24, July 2011.

[42] A. A. Chien and J. H. Kim. Planar-adaptive Routing: Low-cost Adaptive Networks for

Multiprocessors. The 19th Annual International Symposium on Computer Architecture,

pages 268-277, 1992.

[43] Z. Jiang, J. Wu and D. Wang. A New Fault Information Model for Fault-Tolerant Adaptive

and Minimal Routing in 3-D Meshes. IEEE Transactions on Reliability, 57(1):149-162,

March 2008.

[44] D. Xiang, Y. Zhang and Y. Pan. Practical Deadlock-Free Fault-Tolerant Routing Based on

the Planar Network Fault Model. IEEE Transactions on Computers, 58(5):620-633, May

2009.

[45] S. Pasricha and Y. Zou. A Low Overhead Fault Tolerant Routing Scheme for 3D Networks-

on-Chip. The 12th International Symposium on Quality Electronic Design, pages 1-8,

March 2011.

[46] S. Akbari, A. Shafieey, M. Fathy and R. Berangi. AFRA: A Low Cost High Performance

Reliable Routing for 3D Mesh NoCs. Design, Automation & Test in Europe Conference &

Exhibition, pages 332-337, March 2012.

[47] M. Ebrahimi, M. Daneshtalab, and J. Plosila. Fault-Tolerant Routing Algorithm for 3D

NoC Using Hamiltonian Path Strategy. In Proc. of Design, Automation & Test in Europe

Conference & Exhibition (DATE), pages 1601-1604, March 2013.

[48] http://www.altera.com

31

Page 32: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

[49] P. Chan, K. Dai, D. Wu, J. Rao and X Zou. The Parallel Algorithm Implementation of

Matrix Multiplication Based on ESCA. IEEE ASIA Pacific Conference on Circuits and

Systems, pages 1091-1094, December 2010.

[50] Y. L. Lee, J. W. Yang, and J. M. Jou. Design of a distributed JPEG encoder on a scalable

NoC platform. IEEE International Symposium VLSI-DAT, pages 132-135, April 2008

[51] A. A. Chien and J. H. Kim. Planar-Adaptive Routing: Low-Cost Adaptive Networks for

Multiprocessors. Journal of the ACM, 42(1):91-123, January 1995.

[52] R. Sivaram. Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM

SIGSIM Simulation Digest, 22(1):17-27, 1990.

[53] H. Sullivan and T. R. Bashkow. Large Scale, Homogeneous, Fully Distributed Parallel

Machine. Annual Symposium on Computer Architecture, ACM Press, pages 105-117,

March 1977.

[54] A. Ben Ahmed, K. Mori, and A. Ben Abdallah. ONoC-SPL: Customized Network-on-Chip

(NoC) Architecture and Prototyping for Data-intensive Computation Applications. IEEE

Proc. of the 4th International Conference on Awareness Science and Technology, pages

257-262, August 2012.

[55] C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu and A. -Y. Wu. Traffic and Thermal-aware

Run-time Thermal Management Scheme for 3D NoC Systems. In Proc. of the ACM/IEEE

International Symposium on Networks-on-Chip (NoCS), pages 223-230, May 2010.

[56] B. Feero and P. P. Pande. Performance Evaluation for Three-Dimensional Networks-on-

Chip. Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages

305-310, May 2007.

32

Page 33: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Figure 1: Example of fault-tolerant routing: (a) Look-ahead routing (LAFT) (b) Hybrid routing

(HLAFT).

Figure 2: Router pipeline stages: (a) Look-Ahead-Fault-Tolerant [18] (b) Hybrid-Look-Ahead-

Fault-Tolerant.

33

Page 34: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

(a) (b)

(c) (d)

Figure 3: Example of deadlock-recovery with Random-Access-Buffer.

Figure 4: Input buffer block diagram.

34

Page 35: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

(a) (b)

(c) (d)

Figure 5: Latency per flit comparison results between 2-D and 3-D NoC architectures with: (a)

Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.

35

Page 36: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Algorithm 1: Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT)

// Destination address

Input: Xdest , Ydest , Zdest

// Current node address

Input: Xcur , Ycur , Zcur

// Next-port identifier

Input: Next-port

// Link status information

Input: Fault-in

// New-next-port for next node

Output: New-next-port

// Calculate the next-node address

Next← Next-node (Xcur , Ycur , Zcur , Next-port);

// Read fault information for the next-node

Next-fault← Next-status (Fault-in, Next-port);

// Calculate the three possible directions for the next-node

Next-dir← poss-dir (Xdest , Ydest , Zdest , Nextx, Nexty, Nextz);

// Evaluate the flag

if (|Next-dir| == 0) thenflag← 1;

else flag← 0;;

// Evaluate the New-next-port

if (flag == 1) then

// Trigger local-routing

out-port← local-routing (Fault-in, Xcur , Ycur , Zcur , Xdest , Ydest , Zdest);

// Execute LAFT with the new recomputed out-port

New-next-port← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , out-port, Fault-in);

else

// Execute LAFT with the inputed Next-port identifier

New-next-port← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , Next-port, Fault-in);

end

36

Page 37: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Table 1: Simulation configuration.

Parameters / System XYZ-based LA-XYZ-based LAFT-based HLAFT-based

Network SizeJPEG 3x3x3 3x3x3 3x3x3 3x3x3

(Mesh)Matrix 3x6x6 3x6x6 3x6x6 3x6x6

Transpose & Uniform 4x4x4 4x4x4 4x4x4 4x4x4

Flit size

JPEG 27 bits 30 bits 30 bits 30 bits

Matrix 31 bits 34 bits 34 bits 34 bits

Trans. & Unif. 34 flit 34 flit 34 flit 34 flit

Header size

JPEG 10 bits 13 bits 13 bits 13 bits

Matrix 10 bits 13 bits 13 bits 13 bits

Transpose 16 bits 13 bits 13 bits 13 bits

Payload size

JPEG 16 bits 16 bits 16 bits 16 bits

Matrix 21 bits 21 bits 21 bits 21 bits

Transpose 21 bits 21 bits 21 bits 21 bits

Buffer Depth 4 4 4 4

Switching Wormhole-like Wormhole-like Wormhole-like Wormhole-like

Flow control Stall-Go Stall-Go Stall-Go Stall-Go

Scheduling Matrix-Arbiter Matrix-Arbiter Matrix-Arbiter Matrix-Arbiter

Routing XYZ LA-XYZ LAFT HLAFT

Target FPGA device Stratix III Stratix III Stratix III

37

Page 38: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Table 2: Router hardware complexity evaluation comparison results between 2D and 3D NoC

architectures.

System / Parameter Area Total Power Speed

(ALUTs) Mw Mhz

2D (XY-based) 78201 1399.08 125.32

2D (LA-XY-based) 90576 1211.06 109.22

3D (XYZ-based) 138578 1210.77 107.34

3D (LA-XYZ-based) 145287 1121.89 100.89

(a) (b)

(c) (d)

Figure 6: Throughput comparison results between 2-D and 3-D NoC architectures with: (a)

Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.

38

Page 39: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

Table 3: Router hardware complexity comparison results between 3D-NoC systems.

System/ Parameter Area Static Power Speed

(mW) (MHz)

XYZ-based 2809 1258.01 194.61

LA-XYZ-based 3134 1269.76 194.09

LAFT-based 3272 1296.92 178.51

HLAFT-based 4277 1334.5 174.68

HLAFT+RAB-based 4580 1336.4 174.42

(a) (b)

(c) (d)

Figure 7: Latency per flit comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based

3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.

39

Page 40: Graceful Deadlock-Free Fault-Tolerant Routing Algorithm

Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)

(a) (b)

(c) (d)

Figure 8: Throughput comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based

3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.

Table 4: Average dynamic-power evaluation under different fault-rates.

System/ Fault-rate 0% 5% 10% 20%

XYZ-based 165.2 165.2 165.2 165.2

LA-XYZ-based 169.9 169.9 169.9 169.9

LAFT-based 170.09 172.92 178.51 194.61

HLAFT-based 173.7 174.61 183.68 231.43

HLAFT+PM-based 170.11 170.2 174.42 180.02

40