6
ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR NETWORK-ON-CHIP SYSTEMS Chia-An Lin, Hsien-Kai Hsin, En-Jui Chang, and An-Yeu (Andy) Wu Graduate Institute of Electronics Engineering, National Taiwan University No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan (R.O.C) Email: [email protected] ABSTRACT With the shrinking size of circuits and the scaling of Network-on-Chip (NoC), the on-chip components will have a higher chance to fail. The on-chip failures can cause traffic congestion and even system crash. To overcome this problem, the NoC routing algorithm should be implemented with fault-tolerant capability. Inspired by the fault-tolerant behavior of ant colony consisting of three steps: Encounter, Search, and Select, we propose Ant Colony Optimization- based Fault-aware Routing (ACO-FAR) algorithm for traffic balancing. To effectively forward the packets to a non-faulty region, three mechanisms of ACO-FAR correspond to the three-step behaviors of ants are proposed in this work. The simulation results show that proposed ACO-FAR has higher throughput than related works by 12.5%-77.7%. Also, this routing method improves the reachable packet ratio to 99.50%-99.98% and the distribution of traffic load in the faulty network. Index Terms—Ant Colony Optimization, Network-on- Chip, Fault-tolerant Routing, Fault-aware Mechanism 1. INTRODUCTION For Multiprocessor System-on-Chip (MPSoC), Network-on- Chip (NoC) provides flexible, reliable, and scalable on-chip communication architecture and has advantages of low latency and high throughput [1]. However, with the development of semiconductor technology, the density of on-chip components increases. The defective transistors and failure in interconnections become more serious [2]. Moreover, with the scaling up of the system, the on-chip components have a higher chance to fail. Unfortunately, these failures cause the unbalanced traffic load and even system crash. Thus, fault-tolerant approaches are critical for building reliable systems and increasing the product yield. In the recent years, many NoC fault-tolerant schemes have been proposed [2]. These schemes consist of fault detection mechanism and fault-tolerant routing algorithm, as shown in Fig.1. Firstly, the fault detection mechanism uses the particular circuit to detect, locate, and isolate the faulty part of routers [3]. Then, the fault-tolerant routing algorithms [4-7] can detour the faulty routers based on the fault information while keeping the packet transmission complete. However, the resulting low path diversity causes the unbalanced network traffic. That is, the traffic loads are congested around the faulty node, and the system performance degraded drastically. Hence, for further improvement of system performance, the fault-tolerant routing algorithms need to be designed with higher path diversity and traffic balance ability. ACO-based adaptive routing was proposed for NoC traffic balance in [8]. ACO is a bio-inspired algorithm that mimics the process of an ant colony in finding the shorter path from nest to food. It can reduce the traffic congestion by the usage of ant pheromone as the network historical information and achieve traffic balance effectively. According to [9], ant colony also shows the fault- tolerant behavior. It consists of three steps, 1) encounter the obstacle, 2) search for available paths, and 3) select of the better path. By regarding fault routers as obstacles and packets as ants, the condition is the same in the network. Therefore, inspired by this behavior of ants, we propose an ACO-based fault-aware routing algorithm (ACO-FAR) with three techniques. The main contributions of this paper include the following: 1) Notification mechanism of fault information: We propose provides a low-cost dynamic notification mechanism to effectively propagate fault information in the network. 2) ACO-based fault-aware path searching mechanism: To provides as much path diversity as possible, this Figure 1. Fault-tolerant scheme in NoC system. 342 2013 IEEE Workshop on Signal Processing Systems 978-1-4673-6238-2/13 $31.00 © 2013 IEEE

ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR NETWORK-ON-CHIP SYSTEMS

Chia-An Lin, Hsien-Kai Hsin, En-Jui Chang, and An-Yeu (Andy) Wu

Graduate Institute of Electronics Engineering, National Taiwan University

No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan (R.O.C) Email: [email protected]

ABSTRACT

With the shrinking size of circuits and the scaling of Network-on-Chip (NoC), the on-chip components will have a higher chance to fail. The on-chip failures can cause traffic congestion and even system crash. To overcome this problem, the NoC routing algorithm should be implemented with fault-tolerant capability. Inspired by the fault-tolerant behavior of ant colony consisting of three steps: Encounter, Search, and Select, we propose Ant Colony Optimization-based Fault-aware Routing (ACO-FAR) algorithm for traffic balancing. To effectively forward the packets to a non-faulty region, three mechanisms of ACO-FAR correspond to the three-step behaviors of ants are proposed in this work. The simulation results show that proposed ACO-FAR has higher throughput than related works by 12.5%-77.7%. Also, this routing method improves the reachable packet ratio to 99.50%-99.98% and the distribution of traffic load in the faulty network.

Index Terms—Ant Colony Optimization, Network-on-Chip, Fault-tolerant Routing, Fault-aware Mechanism

1. INTRODUCTION

For Multiprocessor System-on-Chip (MPSoC), Network-on-Chip (NoC) provides flexible, reliable, and scalable on-chip communication architecture and has advantages of low latency and high throughput [1]. However, with the development of semiconductor technology, the density of on-chip components increases. The defective transistors and failure in interconnections become more serious [2]. Moreover, with the scaling up of the system, the on-chip components have a higher chance to fail. Unfortunately, these failures cause the unbalanced traffic load and even system crash. Thus, fault-tolerant approaches are critical for building reliable systems and increasing the product yield.

In the recent years, many NoC fault-tolerant schemes have been proposed [2]. These schemes consist of fault detection mechanism and fault-tolerant routing algorithm, as shown in Fig.1. Firstly, the fault detection mechanism uses the particular circuit to detect, locate, and isolate the

faulty part of routers [3]. Then, the fault-tolerant routing algorithms [4-7] can detour the faulty routers based on the fault information while keeping the packet transmission complete.

However, the resulting low path diversity causes the unbalanced network traffic. That is, the traffic loads are congested around the faulty node, and the system performance degraded drastically. Hence, for further improvement of system performance, the fault-tolerant routing algorithms need to be designed with higher path diversity and traffic balance ability.

ACO-based adaptive routing was proposed for NoC traffic balance in [8]. ACO is a bio-inspired algorithm that mimics the process of an ant colony in finding the shorter path from nest to food. It can reduce the traffic congestion by the usage of ant pheromone as the network historical information and achieve traffic balance effectively.

According to [9], ant colony also shows the fault-tolerant behavior. It consists of three steps, 1) encounter the obstacle, 2) search for available paths, and 3) select of the better path. By regarding fault routers as obstacles and packets as ants, the condition is the same in the network. Therefore, inspired by this behavior of ants, we propose an ACO-based fault-aware routing algorithm (ACO-FAR) with three techniques. The main contributions of this paper include the following: 1) Notification mechanism of fault information: We

propose provides a low-cost dynamic notification mechanism to effectively propagate fault information in the network.

2) ACO-based fault-aware path searching mechanism: To provides as much path diversity as possible, this

Figure 1. Fault-tolerant scheme in NoC system.

342

2013 IEEE Workshop on Signal Processing Systems

978-1-4673-6238-2/13 $31.00 © 2013 IEEE

Page 2: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

mechanism searches for all possible paths to neighboring nodes except for faulty paths.

3) ACO-based fault-aware path selecting mechanism: To relieve the traffic congestion around the faulty nodes, this mechanism can be aware of faulty nodes and select the better path with ant pheromone. Besides, such fault awareness, can make the system performance degrade gracefully.

2. RELATED WORKS

In recent years, many fault-tolerant routing algorithms have been proposed to deal with faulty routers in NoC. Generally, there are two main categories of these algorithms: 1) turn model based [4-6] and 2) virtual channels (VCs) [7] based fault-tolerant routing algorithms.

2.1 Turn Model Based Fault-tolerant Routing

The turn model based fault-tolerant routing algorithms [4-6] place restrictions on the routing function. The restrictions prohibit particular turns next to the faulty region. In addition, these methods may have some assumptions on the location of faults. However, these routing algorithms only provide the robustness under certain circumstances. Moreover, the system performance drastically degrades due to lower path diversity and traffic imbalance. In contrary, an ACO-based fully adaptive routing has high potential to efficiently divert traffic to less congested areas and improve performance. 2.2 Virtual Channels Based Fault-tolerant Routing

The VCs based fault-tolerant routing algorithms [7] release parts of routing restrictions from the turn model based approaches by using VCs, which allows multiple transactions to share a single physical channel in time multiplexing. However, for the resource-limited NoC system, the hardware cost of routers is a critical issue. Due to high area cost and power consumption of VC, VCs-based routing is unacceptable in the resource-limited NoC. Therefore, in this paper, we propose an ACO-based routing without using VCs to achieve fault-tolerance.

3. PROPOSED ROUTING ALGORITHM According to [9], the fault-tolerant behavior of an ant colony consists of three steps, 1) encounter, 2) search, and 3) select. In Fig. 2(a), firstly, when the obstacle appears, ants on the pheromone trails encounter with it and cannot move forward directly. Then, they search for other available paths by random directions to detour from the obstacle. After a short period, the shorter path continues to accumulate pheromone. Finally, ants select the better path to pass through.

3.1 ACO-based Fault-Aware Routing (ACO-FAR) By regarding faulty routers as obstacles and packets as ants, similar condition is the same in the network. With this assumption, we propose three schemes according to ants’ three-step behavior to improve fault-tolerance ability and tightly combine with ACO-based adaptive routing [8].

In general, adaptive routing determines the suitable output channel for each packet based on network status. It consists of a routing function and selection function. The routing function gives a set of candidate channels, and the selection function chooses one proper output channel based on the network information, such as output queue length. We modified this routing scheme by adding fault information for achieving fault-tolerance. The path selecting mechanism and the path selecting mechanism are corresponding to the routing function and the selection function, respectively. They route packets to proper output direction using the fault information in fault-awareness process. The proposed routing process is shown in Fig. 2(b) and discussed below. 3.1.1 Notification mechanism of fault information

First of all, in order to add the fault information to the routing algorithm, we propose a mechanism that collects and propagates the information of faults. This mechanism can make the fault-aware routing decision more efficient. A fault detection mechanism is generally implemented in the router, to locate the faulty position of the routers. The faulty routers are disabled from transmitting packets. Then these routers send Fault Regional Index (FRI) signals to the neighboring routers.

Due to the limited resource in the NoC system, we implement the FRI as a local signal for minimizing the cost to provide fault information. We also makes the FRI adjustable and scalable for different network sizes. We set

Figure 2. (a) Fault-tolerant behavior of an ant colony. (b) Corresponding routing process of the proposed algorithm.

343

Page 3: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

the FRI as an n-bit local signal, where nsubjects to the network size.

Figure 3. (a) The propagation of FRI (b) Threaction of receiving different values of FRI

For example, in Fig. 3(a), we assign th

local signal for an 8×8 mesh NoC. Whdetected by the fault detection mechanism, signals connecting to the adjacent routers atare set to 3, which is the maximum value fThis value decreases when propagated to until it reduces to zero. This means that the can propagate to at most 4-hops away. Mmultiple faults situation, the value of FRI ismaximum value of receiving FRI signalrouters in order to evaluate the worst-case fashown in (1). 3 , max , , , 1 , (1)

The path searching mechanism and thmechanism react correspondingly dependvalue received, as listed in Fig. 3(b). mechanism can bring the traffic load awaynode and reduce the congestion of nearbgreatly alleviates the problem of performanc 3.1.2 ACO-based fault-aware path searchin

When receiving FRI signals from neighborouter can identify whether its neighbornormal or faulty. The path searching mecprovide appropriate candidate output chanselecting mechanism.

The path searching mechanism spossible paths to adjacent nodes except foprovide higher path diversity. There are threshown in Fig. 4 to illustrate this process: 1) Case I: A packet at the source nod

destination node. The path searchprovides fully-minimal paths (i.e., Norcandidate channels since the faulty nodto the source node (receiving FRI does

2) Case II: The situation is similar to Casfaulty node is adjacent to the node se(receiving FRI equals to 3). As a resu

n is a value that

he corresponding I.

he FRI as a 2-bit hen the fault is the FRI value of

t the faulty router for a 2-bit signal. the adjacent hop fault information

Moreover, for the s obtained by the ls from adjacent fault condition, as

he path selecting ding on the FRI Hence, the FRI

y from the faulty by routers. This ce degradation.

ng mechanism

oring routers, the ring routers are

chanism can then nnels to the path

searches for all or faulty paths to ee common cases

de is sent to the hing mechanism rth and East) for de is not adjacent not equal to 3).

se I except for the nding the packet

ult, the candidate

channel provided by the pathmeanwhile the pheromone of Ezero.

3) Case III: The only minimadestination is blocked by the there are no possible minimal pthe path searching mechanismpaths (i.e., North and South) instead of interrupting the pasituation is called packet detosearch behavior of ants.

Regards to the deadlock issue caufully-minimal paths and packet deschemes in ACO-based DeadlockDAR) [10], which greatly suppredeadlock while the area overhead is 3.1.3 ACO-based fault-aware path s

With the pheromone table of Amechanism chooses the better candidate output channels providedmechanism. The pheromone tablepackets. The ant packets collect thethe routing process and update the rule, as shown in (2). The normalcan be regarded as the probability oj (North, East, South, and West) packet transmission to destinatioproportional to the inverse of the leat channel j; Nk is the number of chk; and α is the weighting coefficienhistorical information of the netwzero to one. β is the fault penalty fac

ii RjPhRjPh α +×−= ),()1[(),('

(2)

To select the better output network, the Fault informationconsideration for reducing the probto the faulty region. Since the fauregion to be congested, the output cof FRI represents a limitation Therefore, the fault penalty β transition rule when making the selnot alter the pheromone table. Accoβ is decided by the value of FRI, an

Figure 4. Illustrations of p

h selecting is North and East channel is also set to

al path from source to faulty node. Therefore,

paths to transmit. Hence, m provides non-minimal

for candidate channels acket transmission. The

our and is similar to the

using from the using of etour, we also adopt the k-Aware Routing (ACO-esses the occurrence of minor.

selecting mechanism

CO, the path selecting output channel from

d from the path searching e is constructed by ant e network information in table by state transition

ized pheromone Ph’(j,d) of selecting channel index

for the direction of the on index d. Lj is the ength of the input buffer hannels of current router nt for the current and the

work, which ranges from ctor.

jk

j

NL

βα ×−

×+ ]1

channel in the faulty n, FRI, is taken into bability of selecting path ult may cause its nearby channel with higher value

on the path diversity. is introduced to state

lection decision, but does ording to (3), the value of nd it is implemented with

path searching.

344

Page 4: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

exponential decay. This is hardware-friendly. Note that α is constant, so the overhead of state transition rule implementation can be implemented by using a constant multiplier or even a barrier shifter. Furthermore, by making use of the existing hardware of Regional ACO-based routing [11], which reduced about 90% cost on the pheromone table, thus, the area overhead is also minor while only adding a penalty factor. FRI−= 2β (3) In summary, the flow chart is showed in Fig. 5, and the routing process is activated when a head flit arrive in input buffers. The process can finish in one cycle. Firstly, the router receives the FRI, transmitting from the adjacent nodes. Second, the path searching mechanism uses the information to determine whether the packet would be blocked by the faulty node or not. If there are no ways to route, the non-minimal paths are added. In the end, those candidate channels would be selected from the path selecting mechanism.

4. PERFORMANCE EVALUATIONS 4.1 Simulation Environment and Setup The experiments are evaluated by SystemC NoC simulator Noxim [12]. The network topology is 8×8 mesh. While the packet length is 8 flits, 4 input buffers with the depth of 4 flits in a router. For the traffic pattern, we use the uniform traffic, and multimedia system (MMS) traffic [13]. In uniform traffic, each packet is randomly sent to each destination with the equal probability. In MMS traffic, we map and schedule 40 video/audio tasks on 25 IPs in 5×5 mesh as the realistic traffic. The simulation time is 20,000 cycles and the first 10,000 cycles is the warm-up time for measuring the performance of steady network. The performance index is the average latency under different packet injection rate. Moreover, we also adopt the saturation throughput [14], which is the throughput where the average

latency equals to twice of the zero-load latency, as our evaluation metric. 4.2 Performance Analysis of ACO-FAR The first simulation is the performance comparison with Modified X-First [4], FADyAD [5], Gradient [6], and ACO-FAR, as shown in Fig. 6. These routing algorithms are the turn model based fault-tolerant routing that provides lower path diversity. The traffic pattern is uniform, and each simulation has a different number of faulty routers. Note that Modified X-First routing can only handle single fault, so it is excluded from the simulation of multiple faulty nodes. In Fig. 6(a), there is the faulty router on the center of the mesh network, and the improvement from ACO-FAR to other related works are 33.3%-77.7% in saturation throughput, which conforms to previous discussion. In Fig. 6(b), there is a 2-faulty nodes, and the improvement are 41.7%-54.5%. In Fig. 6(c), 4 faulty nodes happened around the center region of the mesh, and the improvement are 40.0%-55.5%.

The second simulation is the performance comparison in MMS traffic, which is the realistic traffic of the multimedia system including H.263 video encoder, an H.263 video decoder, an MP3 audio encoder, and an MP3 audio decoder. The result is also shown in Fig. 7 that the improvement of saturation throughput are 12.5%-28.6%.

Figure 5. Flow chart of ACO-based fault-aware routing.

345

Page 5: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

Figure 6. Performance of fault-tolerant routiunder uniform traffic with (a) 1 fault. (b) 2 ffaults.

Figure 7. Performance of fault-tolerant rounder MMS traffic with 1 fault. 4.3 Evaluation of Fault-Tolerance Ability For evaluating the fault-tolerance ability, ththe simulation is unreachable packets ratnumber of unreachable packets divides bytotal packets injected to the network. Tpacket is the packet blocked by fault congestion and thus cannot successfudestination for a long period. The simuuniform, and the network is an 8×8 mesh faulty routers. The compared algorithms are

ing algorithms faults. (c) 4

outing algorithms

y

he index we set in tio, which is the y the number of The unreachable nodes or traffic

fully reach the ulation traffic is with 1, 2, and 4

e the same works.

In Table 1, Modified X-First has packet ratio and can only toleranGradient has relatively lower unreaFADyAD because of considering destination node and the faulty nodblockage. The restrictions of these path diversity and thus weaken theIn contrast, ACO-FAR has much bepackets in the faulty network. Its paprovides higher path diversity to tpath selecting mechanism with routing toward the congested region

The other simulation for evaluability is the statistical traffic loaresult of sending the same numbpacket injection rate at the satusimulation traffic is uniform. Threpresent the traffic load of the rouis shown in Fig. 8, compared with Grelatively better than other works inthe traffic load of ACO-FAR is alsthe faulty node.

Figure 8. The statistical traffic load throughput of (a) ACO-FAR. (b) Gr

5. CONCLUSI

In this paper, we propose the ACOinspired by the fault-tolerant behafault-tolerance for NoC, and its areaWith the proposed algorithm andshow that the improvements on satACO-FAR to other related fault-tolare 12.5%-77.7%. Moreover, the Areachable packet ratio to 99.50-99.of traffic load in the faulty network.

TABLE 1. UNREACHABLE P

the highest unreachable nt a single faulty router. achable packet ratio than

the relation among the de for preventing packet algorithms still limit the

eir fault-tolerance ability. etter ability to deliver the ath searching mechanism olerant the fault and the fault-awareness avoids

n of the faulty node. uating the fault-tolerance ad distribution. It is the ber of packets with the uration throughput. The he more routed packets uter is heavier. The result Gradient which performs

n the previous simulation, so more balanced around

at the saturation radient.

ONS

O-FAR that biologically-avior of ants to achieve a overhead is also minor. d flow, our simulations turation throughput from lerant routing algorithms ACO-FAR improves the 98% and the distribution

PACKET RATIO.

346

Page 6: ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR …

6. ACKNOWLEDGEMENT

This work was supported in part by the National Science Council under NSC 100-2221-E-002-091-MY3.

7. REFERENCES

[1] L. Benini and G.D. Micheli, “Network on chip: a new paradigm for systems on chip design,” in Proc. IEEE Conf. on DATE, pp.418-419, 2002. [2] M. Radetzki, C. Feng, X. Zhao, and A. Jantsch, “Methods for fault tolerance in network on chip,” ACM Computing Survey, vol. 44, pp. 1-36, Jan. 2013. [3] S.Y. Lin, W.C. Shen, C.C. Hsu, C.H. Chao, and A.Y. Wu “Fault-tolerant router with built-in self-test/selfdiagnosis and fault-isolation circuits for 2D-mesh based chip multiprocessor systems”. in Proc. IEEE Conf. on VLSI-DAT, pp 72–75, April 2009 [4] Z. Zhang, A. Greiner, and S. Taktak, “A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip,” in Proc. ACM/IEEE Conf. on DAC, pp 441-446, June 2008 [5] A. Mehranzadeh, A. Khademzadeh, and A. Mehran, “FADyAD- Fault and congestion Aware Routing Algorithm Based on DyAD Algorithm,” in Proc. IEEE Conf. on IST, pp 274-279, Dec. 2010. [6] I. Pratomo, and S. Pillement, “Gradient—An adaptive fault-tolerant routing algorithm for 2D mesh Network-on-Chips,” in Proc. DASIP, pp. 1-8, Oct. 2012 [7] S. Pasricha, and Y. Zou, “NS-FTR: A fault tolerant routing scheme for networks on chip with permanent and runtime intermittent faults,” in Proc. ASP-DAC, pp. 443-448, Jan. 2011 [8] M. Daneshtalab, and A. Sobhani, “NoC hot spot minimization using antnet dynamic routing algorithm,” in Proc. IEEE Conf. on ASAP, pp. 33-38, 2006. [9] R. Beckers, J.L. Deneubourg, and S. Goss, “Trails and U-turns in the selection of the shortest path by the ant Lasius Niger,” Journal of Theoretical Biology, vol. 159, pp. 397–415, 1992. [10] K.Y. Su, H. K. Hsin, E.J. Chang, and A.Y. Wu, “ACO-based deadlock-aware fully-adaptive routing in network-on-chip systems,” in Proc. IEEE Workshop on SiPS, pp. 209-214, Oct. 2012. [11] H.K. Hsin, E.J. Chang, C.H. Chao, and A.Y. Wu, “Regional ACO-based routing for load-balancing in NoC systems,” in Proc. IEEE Second World Congress on NaBIC, pp. 370-376, Dec. 2010. [12] “Noxim: the network-on-chip simulator,” http://sourceforge.net/projects/noxim, 2008. [13] G. Ascia, V. Catania, and M. Palesi, “Multi-objective mapping for mesh-based NoC architectures,” in Proc. IEEE Conf. on Hardware/Software Codesign and System Synthesis, pp. 182-187, Sept. 2004 [14] W.J. Dally and B. Towles, “Principles and practices of interconnection networks,” Morgan Kaufmann, 2004.

347