8
A finite state machine based fault tolerance technique for sequential circuits Aiman H. El-Maleh a,, Ayed S. Al-Qahtani b a Department of Computer Engineering, King Fahad University for Petroleum and Minerals, Dhahran, Saudi Arabia b Department of Computer Engineering, King Saud University, Riyadh, Saudi Arabia article info Article history: Received 4 August 2012 Received in revised form 24 October 2013 Accepted 25 October 2013 Available online 25 November 2013 abstract With technology advancement at the nanometer scale, systems became more subjected to higher man- ufacturing defects and higher susceptibility to soft errors. Currently, soft errors induced by ion particles are no longer limited to a specific field such as aerospace applications. This raises the challenge to come up with techniques to tackle soft errors in both combinational and sequential circuits. In this work, we propose a finite state machine (FSM) based fault tolerance technique for sequential circuits. The proposed technique is based on adding redundant equivalent states to protect few states with high probability of occurrence. The added states guarantee that all single faults occurring in the state variables of highly occurring states or in their combinational logic are tolerated. The proposed technique has minimal area overhead as only few states need protection. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Probability of failure of digital systems grows in direct propor- tion to Moore’s law [1]. Continuous improvements in CMOS tech- nology entering the nanometer scale has resulted into quantum mechanical effects creating many technological challenges for fur- ther scaling of CMOS devices. This has led to the exploration of new technologies for circuit design. Nanotechnology-based fabrication is expected to offer the extra density and potential performance to take electronic circuits the next step. It is estimated that molec- ular electronics can achieve very high densities (10 12 devices per cm 2 ) and operate at very high frequencies (of the order of THz) [2]. Nano-scale devices are limited by higher defect rates and in- creased susceptibility to soft errors. The reduced noise tolerance of these devices is responsible for inducing device malfunctions by external influences like EMI, thermal perturbations and cosmic radiations. Temporal transient faults (soft errors) can hit either in the com- binational logic or flip flops of a sequential circuit. If the error oc- curs in the combinational logic, it will result in Single Event Transient (SET). On the other hand, if it occurs in the memory cell itself, it will result in a Single Event Upset (SEU). Both of SET and SEU cause a major implication in sequential circuit and should re- ceive a proper treatment. Transient faults (SET/SEU) are mainly caused by ions movement through the materials of ICs. With feature sizes reaching below 0.35 lm, SET and SEU faults are no longer considered a small attenuation. Instead they will be consid- ered as normal circuit signals. Soft Error Rate (SER) will grow in di- rect proportion to the number of cells in the design and with reduction in voltage [3]. Fortunately, there are some masking properties that prevent transient faults from affecting sequential circuits, namely: logical masking, electrical masking and latching window masking. Logical masking prevents the SET from propagating from the fault location to a circuit output due to the logic of the circuit. For example, a 2- input AND gate can mask any fault in one input if the other input has a ‘‘0’’ value. Electrical masking attenuates or completely masks the SET signal due to electrical properties of gates. Latching win- dow masking occurs due to the arrival of the transient pulse out- side the latching window for the memory element. However, the reduction in feature sizes limits the effect of electrical and latching window masking. In order to overcome the soft errors effect problem in sequential circuits, several techniques have been proposed in the literature. Triple Modular Redundancy (TMR) is one of the well known tech- niques to reduce the impact of soft errors in combinational logic. In TMR, all the three identical modules perform the same operation, and a voter accepts outputs from all three modules, producing a majority vote at its output. If a single voter is used, that voter be- comes a critical point of failure and the reliability of the TMR struc- ture is limited by that of the final arbitration unit (i.e., voter). Despite this limitation, TMR is heavily used in practice whenever the reliability of the circuit is a crucial demand especially when single faults are needed to be protected. In [4], a TMR hardening technique addressing SEU faults is proposed based on a temporally redundant sampling latch. 0026-2714/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.microrel.2013.10.022 Corresponding author. E-mail addresses: [email protected] (A.H. El-Maleh), [email protected] (A.S. Al-Qahtani). Microelectronics Reliability 54 (2014) 654–661 Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel

A finite state machine based fault tolerance technique for sequential circuits

  • Upload
    ayed-s

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A finite state machine based fault tolerance technique for sequential circuits

Microelectronics Reliability 54 (2014) 654–661

Contents lists available at ScienceDirect

Microelectronics Reliability

journal homepage: www.elsevier .com/locate /microrel

A finite state machine based fault tolerance technique for sequentialcircuits

0026-2714/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.microrel.2013.10.022

⇑ Corresponding author.E-mail addresses: [email protected] (A.H. El-Maleh), [email protected] (A.S.

Al-Qahtani).

Aiman H. El-Maleh a,⇑, Ayed S. Al-Qahtani b

a Department of Computer Engineering, King Fahad University for Petroleum and Minerals, Dhahran, Saudi Arabiab Department of Computer Engineering, King Saud University, Riyadh, Saudi Arabia

a r t i c l e i n f o a b s t r a c t

Article history:Received 4 August 2012Received in revised form 24 October 2013Accepted 25 October 2013Available online 25 November 2013

With technology advancement at the nanometer scale, systems became more subjected to higher man-ufacturing defects and higher susceptibility to soft errors. Currently, soft errors induced by ion particlesare no longer limited to a specific field such as aerospace applications. This raises the challenge to comeup with techniques to tackle soft errors in both combinational and sequential circuits. In this work, wepropose a finite state machine (FSM) based fault tolerance technique for sequential circuits. The proposedtechnique is based on adding redundant equivalent states to protect few states with high probability ofoccurrence. The added states guarantee that all single faults occurring in the state variables of highlyoccurring states or in their combinational logic are tolerated. The proposed technique has minimal areaoverhead as only few states need protection.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Probability of failure of digital systems grows in direct propor-tion to Moore’s law [1]. Continuous improvements in CMOS tech-nology entering the nanometer scale has resulted into quantummechanical effects creating many technological challenges for fur-ther scaling of CMOS devices. This has led to the exploration of newtechnologies for circuit design. Nanotechnology-based fabricationis expected to offer the extra density and potential performanceto take electronic circuits the next step. It is estimated that molec-ular electronics can achieve very high densities (1012 devices percm2) and operate at very high frequencies (of the order of THz)[2]. Nano-scale devices are limited by higher defect rates and in-creased susceptibility to soft errors. The reduced noise toleranceof these devices is responsible for inducing device malfunctionsby external influences like EMI, thermal perturbations and cosmicradiations.

Temporal transient faults (soft errors) can hit either in the com-binational logic or flip flops of a sequential circuit. If the error oc-curs in the combinational logic, it will result in Single EventTransient (SET). On the other hand, if it occurs in the memory cellitself, it will result in a Single Event Upset (SEU). Both of SET andSEU cause a major implication in sequential circuit and should re-ceive a proper treatment. Transient faults (SET/SEU) are mainlycaused by ions movement through the materials of ICs. Withfeature sizes reaching below 0.35 lm, SET and SEU faults are no

longer considered a small attenuation. Instead they will be consid-ered as normal circuit signals. Soft Error Rate (SER) will grow in di-rect proportion to the number of cells in the design and withreduction in voltage [3].

Fortunately, there are some masking properties that preventtransient faults from affecting sequential circuits, namely: logicalmasking, electrical masking and latching window masking. Logicalmasking prevents the SET from propagating from the fault locationto a circuit output due to the logic of the circuit. For example, a 2-input AND gate can mask any fault in one input if the other inputhas a ‘‘0’’ value. Electrical masking attenuates or completely masksthe SET signal due to electrical properties of gates. Latching win-dow masking occurs due to the arrival of the transient pulse out-side the latching window for the memory element. However, thereduction in feature sizes limits the effect of electrical and latchingwindow masking.

In order to overcome the soft errors effect problem in sequentialcircuits, several techniques have been proposed in the literature.Triple Modular Redundancy (TMR) is one of the well known tech-niques to reduce the impact of soft errors in combinational logic. InTMR, all the three identical modules perform the same operation,and a voter accepts outputs from all three modules, producing amajority vote at its output. If a single voter is used, that voter be-comes a critical point of failure and the reliability of the TMR struc-ture is limited by that of the final arbitration unit (i.e., voter).Despite this limitation, TMR is heavily used in practice wheneverthe reliability of the circuit is a crucial demand especially whensingle faults are needed to be protected. In [4], a TMR hardeningtechnique addressing SEU faults is proposed based on a temporallyredundant sampling latch.

Page 2: A finite state machine based fault tolerance technique for sequential circuits

A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661 655

In [5], it is concluded that 49% of the overall soft errors for a de-sign manufactured using state of the art technologies result fromsequential elements. Hence many proposed techniques are basedon latch hardening techniques which provide tolerance for faultsoccurring in memory elements. Examples of these techniques aregiven in [6–9].

The work in [10] proposes the use of duplication with self-checking for one-hot encoding FSM. This approach ensures thatany single error will not lead to an incorrect next state. However,it might lead to an erroneous output. The area overhead of this ap-proach is considered large as two sets of selectors are used andeach selector set is equal to the number of flip–flops, which isequal to the number of states. The work in [11] addresses the faulttolerance of FSMs based on N-modular redundancy by adding Nvoters instead of one voter to tolerate errors occurring in the vot-ers. It also guarantees that any single error will not lead to anincorrect next state. However, it might lead to an erroneous out-put. In [12], it is shown that for a given ability to tolerate faults,fault tolerance based on replication yields better circuit reliabilitythan based on error correcting codes. However, for a given areacomplexity, error correcting codes based on orthogonizable codesprovide better reliability than replication. In [13], a novel reducedm out of n coding method is proposed that can be used for the syn-thesis of a totally self checker for tolerating soft errors. The authorspropose totally self-checking synchronous sequential circuits thatare able to recover after an occurrence of a fault.

Rollback recovery is one of the techniques that can be used forfault tolerance. It brings the circuit back several cycles to a state ithas reached in the past and requires to store the state of the circuitat some cycle boundaries. In [14], a model that deals with MultipleBit Upsets (MBUs) is proposed to study the impact of MBUs on thereliability of rollback recovery circuits.

The objective of this work is to investigate the design of faulttolerant sequential circuits based on adding redundant states atthe state diagram level. The objective of adding redundant statesis to guarantee tolerance of all single soft errors of states with highprobability of occurrence. Since steady state probability for severalsequential benchmark circuits have significant variance, our meth-od takes advantage of that and chooses some of the states withhigh probability of occurrence for protection. This way the areaoverhead is kept minimal.

The rest of the paper is organized as follows. Section 2 describesthe proposed FSM-based fault tolerance technique. Section 3 de-scribes the simulation framework used to evaluate circuit failureprobability and reliability. Section 4 presents the results of the pro-posed technique as well as a comparison with existing fault toler-ance techniques in terms of both failure rate and area. The paperfinally concludes with Section 5.

2. Proposed finite state machine based fault tolerancetechnique

In this section, we will introduce a novel idea to increasesequential circuit reliability based on adding redundant equivalentstates to the states with high probability of occurrence. By protect-ing states with high probability of occurrence, the reliability of asequential circuit is increased as it is very likely that the error oc-curs when the circuit is either in one of these states or going to oneof them. The newly added redundant states are equivalent to theprotected states and are assigned the same next state and output.

2.1. Type of states in fault tolerant sequential circuit

States in a fault tolerant sequential circuit can be classified intothree types: normal states, protected states, and redundant states.

Normal states are original states that are not protected for fault tol-erance. Protected states are states that will be considered for reli-ability enhancement due to their high probability of occurrence.For each protected state, equivalent redundant states will be addedto guarantee single soft fault tolerance. Fig. 1 illustrates an exam-ple that shows different types of states. There are 3 original statesin Figs. 1 and 2 of them are protected and 1 is normal. Each of the 2protected states has 4 redundant equivalent states. The number ofredundant states added will be the minimum number required tosatisfy fault tolerance as will be discussed in Section 2.3.

2.2. State probability calculation

In order to decide which of the original states of a finite statemachine need to be protected, states must be sorted based on theirprobability of occurrence. Probabilistic approaches try to correlatethe various probabilities in order to calculate steady state probabil-ities if the FSM is simulated for infinite amount of time.

Logic synthesis [15], verification [16], testing [17,18] and low-power design [19–21] have benefited from using probabilistictechniques. In particular, the behavior of FSMs has been investi-gated using concepts from the Markov chain (MC) theory wheresteady-state and transition probabilities are estimated for largeFSMs [19,22]. This also can be achieved by repeated applicationof the Chapman–Kolmogorov equations [23,24].

In our approach, we used a statistical approach by simulatingthe state machine using random input vectors and determiningthe state probabilities based on it. It is done by simulating the FSMsfor a sequence of 250,000 random vectors and recording the coun-ters for each state. The resulting steady state probability is calcu-lated by dividing those counters by 250,000 for each state.

Table 1 shows several MCNC/LGSynth FSM benchmarks [25]along with the number of states that need to be protected forachieving coverage of 50% and 90%, respectively. The results shownin the table demonstrate that for most circuits few states need tobe protected to enhance their fault tolerance. For most circuits, lessthan 25% of the states have nearly 90% overall probability ofoccurrence.

The FSMs for lion9, pma and train11 are incompletely specified.To ensure correct analysis by our simulation environment, we havemodified them to become completely specified by making thoseunspecified transitions go to the first state in the machine.

2.3. State redundancy based fault tolerance

In order to describe the proposed state redundancy based faulttolerance technique, we use an illustrative example. Then, the cri-teria for determining the minimum number of bits needed to en-code the fault tolerant FSM is established. After that, theproposed fault tolerance minimum state encoding algorithm ispresented.

2.3.1. An illustrative exampleSequential circuit reliability can be increased by adding redun-

dant equivalent states to the FSM. The redundant states have thesame input, next state, and output as the protected states. In orderto illustrate this, let us consider the simple bit flipper exampleshown in Fig. 2. Given a FSM with two states A and B, let us assumethat both states need to be protected. Originally the FSM needsonly one D-FF to be implemented. To tackle single errors we addredundant equivalent states to A (A_rd0, A_rd1 and A_rd2) andto B (B_rd0, B_rd1 and B_rd2) with the same input, output and nextstate. For each row in the original state table, additional redundantrows are added in the protected state table as shown in Table 2.

In order to detect all single errors, the hamming distance be-tween state A, and its redundant states should be 1. Similarly for

Page 3: A finite state machine based fault tolerance technique for sequential circuits

Fig. 1. An example of state types in a fault tolerant sequential circuit.

Table 1MCNC/LGSynth FSM benchmarks steady state probability analysis.

Circuits Number of Number of protected states for

Inputs Outputs States 50% Coverage 90% Coverage

bbara 4 2 10 3 5bbsse 7 7 13 2 3cse 7 7 16 1 2keyb 7 2 19 1 3lion9 2 1 9 2 3pma 8 8 24 2 5s832 18 19 25 1 3s1494 8 19 48 1 2styr 9 10 30 1 3train11 2 1 11 2 6dk14 3 5 7 3 4

Fig. 2. A bit flipper example.

Table 2State table of the bit flipper example before and after protection.

Original state table Protected state table

I PS NS O I PS NS O

0 A(0) A(0) 0 0 A (000) A(000) 00 A_rd0 (100) A(000) 00 A_rd1 (010) A(000) 00 A_rd2 (001) A(000) 0

0 B(1) B(1) 1 0 B (111) B(111) 10 B_rd0 (011) B(111) 10 B_rd1 (101) B(111) 10 B_rd2 (110) B(111) 1

1 A(0) B(1) 0 1 A (000) B(111) 01 A_rd0 (100) B(111) 01 A_rd1 (010) B(111) 01 A_rd2 (001) B(111) 0

1 B(1) A(0) 1 1 B (111) A(000) 11 B_rd0 (011) A(000) 11 B_rd1 (101) A(000) 11 B_rd2 (110) A(000) 1

Fig. 3. Fault tolerant bit flipper states and their encoding with two protected states.

656 A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661

state B, the hamming distance between it and any of its redundantstates should be 1. Therefore, the distance between states A and Bmust be kept at least 3 to ensure that the equivalent states of A donot overlap with the equivalent states of B.

Errors can happen either when the FSM is in a state (i.e., happenin one of the FFs) or going to one of them (i.e., happen in the com-binational logic). Any single bit error that occurs while being in orduring transition to state A or B will lead to an equivalent redun-dant state. Fig. 3 shows the states and their encoding for the bitflipper example after adding the redundant states. In Fig. 3, theminimum possible number of bits to represent all protected andredundant states is 3 bits. This is why we need to add 3 redundant

states to cover all single errors in the state encoding bits. Eventu-ally, we need 3 D-FFs to represent this circuit. If we use less than3 D-FFs, this will result into 2 or more redundant states havingthe same code.

Moreover, SET faults occurring in the combinational logic areprotected by not sharing logic in the combinational logic blocksimplementing the next state equations of memory elements or flipflops. This is done by partitioning the combinational logic in the

Page 4: A finite state machine based fault tolerance technique for sequential circuits

A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661 657

input cones of each D-FF so that no single error can propagate tomore than one D-FF.

It is worth mentioning that in case any of the outputs of the pro-tected states is do not care, that do not care needs to be specifiedbefore synthesis to ensure that the protected state and its equiva-lent states maintain equivalence after synthesis.

In addition to fault tolerance against SET and SEU, multiplefaults can be tolerated if they happen in the same flip flop cone,or if after masking result in a single fault. Also, faults (i.e., SEU orSET) that occur in consecutive cycles are tolerated if a SEU (SET)occurring in one cycle is followed by a SEU (SET) in the next cycle.Also, they are tolerated when a SEU occurring in one cycle is fol-lowed by a SET in the next cycle. This is because such faults willcause the sequential circuit to be in one of the equivalent redun-dant states. However, if a SET occurring in one cycle is followedby a SEU in the next cycle, they might cause a double fault thatmakes the sequential circuit be in a different non-equivalent state.

2.3.2. Minimum state redundancy encoding for fault toleranceThe state assignment problem of an FSM can be viewed as a

coding problem or as a partitioning problem [26]. The coding prob-lem requires each state to be assigned a unique binary pattern.Determining the minimum number of bits required in state encod-ing after adding the redundant states needs further investigation.

In order to specify the number of extra bits needed to encode allstates (original and redundant), we will compute it based on thetotal number of states. Let m be the number of the normal statesin an FSM. Let n be the number of protected states, and a be thenumber of bits needed to encode the protected states. Therefore,a ¼ dlog2ðnÞe. We need to compute the number of extra bits, b,needed to encode all the states.

The total number of bits used in encoding is aþ b, which can beused to encode 2aþb states. Each protected state will have aþ bredundant states, since all single-bit errors are considered (i.e., allthe distance-1 encodings are redundant to each protected state).Therefore, the total number of states will be nð1þ aþ bÞ þm. Thecondition for the number of encoding bits needed is that the num-ber of states must be less than or equal to the number of possibleencodings. Therefore, we get 2aþb P nð1þ aþ bÞ þm. Let us as-sume that m = 0. Then, we need to find b such that:

2aþb P nð1þ aþ bÞ

This inequality serves as a lower bound on the number of bitsneeded for protecting a given number of states.

To ensure that the redundant states do not overlap, the mini-mum possible hamming distance between the protected statesmust be at least 3. This is required to make sure that the distancebetween any two different redundant states is greater than 0. Itmust also be ensured that the hamming distance between everyprotected state and normal state is at least 2.

The upper bound is 2aþaþa ¼ 23a, because the hamming distancebetween the codes of states in the original FSM is 1 and triplicatingtheir codes guarantees a hamming distance of 3. If we combineboth the lower and upper bounds, the resulting inequality is:

23a P 2aþb P nð1þ aþ bÞ

or

3a P aþ b P log2ðnð1þ aþ bÞÞ

Fig. 4 plots the lower and upper bounds as well as the actualminimum number of bits needed to encode a given number of pro-tected states. The X-axis represents the number of protected statesand the Y-axis represents the number of bits used. As illustrated inthe figure, the lower bound is very tight. Therefore, in order to

search for the minimum number of bits needed to encode the pro-tected states, one can start from the lower bound and increase thenumber of bits until the hamming distance constraint is satisfied.

2.3.3. Proposed minimum state redundancy encoding algorithm forfault tolerance

Since we do not have an exact formula for the minimum num-ber of bits needed to encode states in a protected FSM, ourproposed fault tolerance algorithm, shown in Table 3, uses exhaus-tive search to find codes for protected, redundant and normalstates. Although exhaustive search methods consume a lot of timeto find the solution, in our case it is acceptable since the idea isbased on protecting few high probable states.

The proposed fault tolerance algorithm can be divided into anumber of key steps namely, computing the lower bound, findingthe protected states codes, checking if the normal states are cov-ered by the remaining codes based on the current number of bitsused for encoding, getting the redundant states codes and finallygetting the normal states codes.

In Table 3, inputs are the number of protected states, n, and thenumber of normal states, m. After finding a and b, we initializeempty sets for the outputs. Then, we find the lower bound basedon the formula given in Section 2.3.2. Iterations, starting fromthe lower bound until the protected states are covered, are in-spected by searching for codes with hamming distance (HD) 3 be-tween them. In each iteration, the current codes are used togenerate additional codes with pairwise hamming distance 3. Ifthe resulting codes do not cover the needed number of protectedstate codes, the resulting codes are padded with ‘‘0’’ from the leftand the number of bits used so far, b, is incremented by 1.

The core idea of the algorithm is based on how to find codeswith pairwise hamming distance 3. Based on the current codesfor the protected states, and the number of bits currently used toencode them, it runs through all the codes obtained from the cur-rent number of bits used for encoding, b. For each code of the cur-rent protected states codes, if the hamming distance between theinvestigated current code and all the existing protected statescodes is 3 or more, the current code is added to the protected statescodes. Otherwise, the next code is investigated and so on.

Based on the obtained number of bits used to encode the pro-tected states, the remaining codes are investigated to check if theycover the required normal states. The remaining codes at this pointare basically the possible codes obtained by the current number ofbits used minus the number of protected states and their redun-dant states. This amount is 2aþb � nð1þ aþ bÞ where a + b = num-ber of bits used.

The current number of bits used in the encoding, aþ b, is incre-mented by c bits to cover all the normal states, m. The number ofextra bits which is needed to cover all normal states, c, must satisfythat all the possible codes obtained by, 2aþbþc, cover the normalstates, m, the protected states, n, and their redundant states,nðaþ bþ cÞ. By solving for c in the following equation:

2aþbþc P mþ nðaþ bþ cþ 1Þ;

the additional number of bits needed to cover all the normal statescan be obtained.

Another alternative way to find the additional number of bitsneeded to cover all the normal states, c, is by incrementing the cur-rent number of bits used to encode the protected states and theirredundant states, aþ b, by 1 bit and investigate the amount:2aþb � nð1þ aþ bÞ again iteratively until the normal states arecovered.

After that, the redundant states codes are computed for eachprotected state. They are easily found by flipping each bit of theprotected state code to get 1 redundant state code at a time. Then,

Page 5: A finite state machine based fault tolerance technique for sequential circuits

Fig. 4. Lower bound, actual minimum number of bits used, and upper bound.

Table 3Proposed minimum state redundancy encoding algorithm for fault tolerance.

Inputs Get the number of protected states, n and the number of normal states, m, as inputs

Lower bound Let a ¼ dlog2(n)e and let b = 1Compute the lower bound by incrementing b iteratively until aþ b Pdlog2(nð1þ aþ bÞ)e

Protected states codes Let the current number of bits used, b, equal to the lower boundLet the first protected state code be 0 � � �00 with a length of b bitsLet the size of protected state codes, s=1While (s < n){

i ¼ 1do {

hd3 ¼ 1;for (j ¼ 1 to j 6 s)

if (HD(code i, code j) –3) hd3 ¼ 0if ðhd3 ¼¼ 1Þ

add code i to protected state codesincrement s by 1

increment i by 1

} while (i 6 2b and s < n)if ðs < nÞ

increment b by 1pad protected state codes by 0 from left

}

Find c Find the extra number of bits needed to encode all the normal states, cIncrement b by cPad protected state codes by a string of c 0’s from left

Redundant states codes For each protected state, assign all hamming-distance-1 codes as its redundant state codes

Normal states codes Codes that are not used for protected state codes or redundant state codes are assigned to normal state codes

658 A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661

for each normal state, one of the remaining codes is assigned to it.It is required that each normal state has a hamming distance of 2between the protected states and itself. This condition is clearlysatisfied since all the distance 1 codes from the protected statescodes are assigned to the redundant states.

2.3.4. Complexity of proposed fault tolerance algorithmThe complexity of the proposed fault tolerance algorithm is

mainly controlled by the applied method to find the hamming dis-tance 3 codes between the protected states. As it is required thateach code in the set of the protected states codes has pairwisehamming distance 3, the search of such codes goes through 3nested loops. The outer loop costs roughly ðb� lowerboundÞ, whereb is the number of bits needed to encode the protected state codeswith pairwise hamming distance 3. However, since this is a small

number of iterations given that the lower bound is tight, it canbe neglected. The 2nd loop has a complexity of Oð2bÞ and the thirdloop has a complexity of OðnÞ, where n is the number of protectedstate codes. Therefore, the complexity of the proposed faulttolerance algorithm is Oðn � 2bÞ. This complexity is consideredacceptable since the number of protected states is often small.

3. Simulation environment and framework

To demonstrate the effectiveness of our proposed fault toler-ance algorithm, a simulation-based reliability model based onMonte-Carlo simulation as the one used in [27] is adopted. In thissection, we will describe the simulation framework used for reli-ability evaluation, the assumptions made and the fault modeland fault injection mechanism used.

Page 6: A finite state machine based fault tolerance technique for sequential circuits

A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661 659

3.1. Failure rate computation

The procedure for computing the failure rate of a sequential cir-cuit for a given number of faults F is as follows:

� Set the number of failed iterations, k to 0.� For each iteration i from 0 to a certain number of iterations

S:– Generate a random sequence of C clock cycles, where

the length of each entry in the random sequence is thesame as the number of inputs in the circuit.

– Simulate the circuit to get the fault free original outputby applying the generated random sequence and storethe output sequence in O, where the length of each out-put in O is the same as the number of outputs in thecircuit.

– Inject F random faults in the circuit in cycle number dC2e.– Simulate the faulty circuit with the same random

sequence to get the faulty output sequence and store itin OF .

– If the output sequence O differs from OF in any cycle,then k is incremented.

� Calculate the failure rate for F faults by RF ¼ kS.

3.2. Assumptions

A set of assumptions have been made in our analysis as follows:

� Since the number of soft errors grow in direct proportion tothe area of a design [3], we have injected 1 and 5 faults inthe original circuit and their corresponding faults in othercircuits. For example, if the area of a circuit is 2.4 timesthe original circuit, then 1 fault in the original circuitcorresponds to 2.4 faults in that circuit. The failure ratefor 2.4 faults is computed by 0.6 � failure rate for 2faults + 0.4 � failure rate for 3 faults.

� The number of iterations, S, is set to 3000.� The number of cycles, C, is set to 129. Hence the faults are

injected in cycle number 65.� Faults are injected randomly at the gate level of the circuit.� Since the proposed technique affects the protection of

faults occurring in FFs or their logic, the faults are injectedin the combinational logic part that represents the D-typeflip flop equations or the flip flops themselves. In otherwords, the combinational logic part corresponding to theoutput equations is excluded.

� Only logic masking is considered. Electrical masking andlatching window masking are not considered.

3.3. Fault model and fault injection mechanism

In our work, we assume the stuck-at fault model. When we in-ject faults at any gate, it can be either stuck-at-1 (i.e. connected toVdd) or stuck-at-0 (i.e. connected to ground). Each simulationiteration is run for C cycles. Multiple stuck-at faults are injectedsimultaneously in random locations during one cycle (i.e. themiddle one). The simulator starts from the all 0 state.

Fig. 5. Injection traces.

By injecting faults in the middle cycle, FSM trace of states can bedivided into 3 parts as shown in Fig. 5. Pre-injection trace at whichthe FSM is initialized to a certain state; fault injection trace, which isonly 1 cycle at which the faults are injected; and Post-injectiontrace, at which the circuit is tested for fault effect propagation.According to [28], 10 cycles are sufficient for the observation of afault effect in an FSM. In our work, we used 129 cycles, 64 cyclesfor Pre-injection trace, one cycle for fault injection, and 64 cyclesfor the Post-injecting trace.

4. Experimental results

In this section, we will discuss the experiments performed todemonstrate the effectiveness of the proposed fault tolerance tech-nique. A comparison with three other fault tolerance techniques isperformed and the results are discussed.

4.1. Experiments

Several experiments have been performed for the 11 MCNC/LGSynth sequential circuits [25] shown in Table 1. For all the cir-cuits implemented, no logic sharing is allowed between the logiccones of memory elements and outputs including inverters. Foreach circuit, several versions are implemented as follows:

� Original circuit without fault tolerance.� A fault tolerant circuit protecting a number of states with

50% state probability coverage.� A fault tolerant circuit protecting a number of states with

90% state probability coverage.

In addition, for comparison with our proposed approach, threedifferent fault tolerance techniques are implemented for each cir-cuit as follows:

� Latch Hardening: it covers all the methods that involve pro-tection of faults occurring at the memory elements. Weassumed full protection for all faults occurring at the mem-ory elements.

� Triple Modular Redundancy with Triple Voters (TMRTV):based on the approach proposed in [11] by applying TMRon the memory elements along with their combinationallogic for which each memory element is triplicated alongwith its combinational logic and three voters are addedfor each memory element. Each voter feeds one of the triplemodules to ensure tolerance of faults occurring also in thevoters. The output logic is fed by one of the voters.

� One-Hot Encoding with Duplicate (OneHD): based on theapproach in [10] where the FSM is encoded using one-hotencoding and then its next state logic and flip–flops areduplicated. Two sets of select logic and error check logicare used to select the correct set of flip-flop outputs foreach machine duplicate. The output logic is fed by the out-puts of one of the select logic.

4.2. Area overhead estimation

Techniques involving adding redundancy to enhance reliabilityoften pay the price in terms of area. We calculate the size ofsequential circuits relative to the size of a single inverter. Table 4shows the library used to estimate the area of circuits. For latchhardening techniques, we assume that the size of each latch is2.5 times the size of an original latch. This assumption is basedon the design given in [6].

Page 7: A finite state machine based fault tolerance technique for sequential circuits

Table 4Size of gates used.

Gate Size

NOT 1NAND2 2NAND3 3NAND4 4NOR2 2NOR3 3NOR4 4D-FF 12Harden D-FF 30

660 A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661

4.3. Proposed fault tolerance algorithm results

Sequential circuits can be classified into two main classes basedon their states probability coverage. The first class requires theprotection of a small number of states (often less than 25% of the

Table 5Failure rate results for MCNC/LGSynth sequential benchmark circuits – Injecting 1 fault in

Circuits Original non-shared State prob. coverage

50%

bbara 0.051 0.045bbsse 0.084 0.028cse 0.027 0.009keyb 0.025 0.009lion9 0.064 0.034pma 0.079 0.026s832 0.046 0.02s1494 0.067 0.014styr 0.054 0.017train11 0.111 0.042dk14 0.105 0.051

Table 6Failure rate results for MCNC/LGSynth sequential benchmark circuits – Injecting 5 faults i

Circuits Original non-shared State prob. coverage

50%

bbara 0.185 0.179bbsse 0.273 0.125cse 0.126 0.053keyb 0.118 0.056lion9 0.314 0.168pma 0.334 0.136s832 0.196 0.11s1494 0.246 0.077styr 0.235 0.102train11 0.378 0.24dk14 0.354 0.282

Table 7Area overhead for MCNC/LGSynth sequential benchmark circuits.

Circuits Original non-shared State prob. coverage

50%

bbara 197 427bbsse 419 645cse 581 818keyb 763 1092lion9 139 229pma 723 1144s832 790 1040s1494 1479 2099styr 1329 1812train11 206 350dk14 299 660

number of states) to obtain a minimum of 90% state probabilitycoverage. The second class requires the protection of a large num-ber of states (often more than 50% of the number of states) to ob-tain a minimum of 90% state probability coverage. Sequentialcircuits bbara, train11 and dk14 belong to the second class whileall other circuits belong to the first class.

Results of failure rates for injecting 1 and 5 faults in the originalcircuits and their corresponding faults in other circuits for the 11MCNC/LGSynth benchmark circuits are shown in Tables 5 and 6,respectively. For each circuit, the failure rate of the original circuit,50%, and 90% state probability coverage, triple modular redun-dancy with three voters for the flip–flops along with their combi-national logic (TMRTV), latch hardening and one-hot encodingwith duplicate (OneHD) are reported.

It is evident from the results that our proposed approach with90% state probability coverage protection achieves the lowest fail-ure rates for most of the circuits among all compared techniques.

original circuit and corresponding faults in other circuits.

Comparison

90% Harden TMRTV OneHD

0.02 0.039 0.015 0.0580.006 0.066 0.016 0.0260.005 0.021 0.01 0.0230.007 0.021 0.01 0.0140.01 0.069 0.051 0.1190.01 0.075 0.02 0.0380.006 0.04 0.016 0.0690.008 0.055 0.015 0.0790.009 0.047 0.011 0.0650.028 0.118 0.036 0.0750.028 0.089 0.02 0.115

n original circuit and corresponding faults in other circuits.

Comparison

90% Harden TMRTV OneHD

0.122 0.161 0.131 0.4240.069 0.261 0.134 0.1840.04 0.096 0.072 0.2010.037 0.095 0.071 0.1210.106 0.304 0.344 0.5360.088 0.294 0.148 0.2910.04 0.164 0.098 0.3720.054 0.226 0.097 0.4440.047 0.196 0.078 0.390.228 0.409 0.314 0.5430.21 0.342 0.193 0.566

Comparison

90% Harden TMRTV OneHD

742 269 647 806704 491 1083 1065942 653 1410 1781

1289 853 2314 1994355 211 511 586

1551 813 1937 23531440 880 1990 27422351 1587 2934 51062291 1419 2980 4260

737 278 709 941889 353 698 731

Page 8: A finite state machine based fault tolerance technique for sequential circuits

A.H. El-Maleh, A.S. Al-Qahtani / Microelectronics Reliability 54 (2014) 654–661 661

This is followed by TMRTV, 50% state probability coverage, latchhardening and finally one-hot encoding with duplicate.

For faults corresponding to 1 fault in the original circuit, our ap-proach with 90% state probability coverage achieves lower failurerates than TMRTV for 9 out of the 11 compared circuits and lowerfailure rates than latch hardening and OnehD for all compared cir-cuits. In addition, our approach with 50% state probability coverageachieves lower failure rates than latch hardening and OnehD for 10out of the 11 compared circuits.

For faults corresponding to 5 faults in the original circuit, ourapproach with 90% state probability coverage achieves lower fail-ure rates than TMRTV for 10 out of the 11 compared circuits andlower failure rates than latch hardening and OnehD for all com-pared circuits. In addition, our approach with 50% state probabilitycoverage achieves lower failure rates than TMRTV for 7 out of the11 compared circuits, lower failure rates than latch hardening for10 out of the 11 circuits and lower failure rates than OnehD forall the compared circuits.

The one-hot encoding with duplicate fault tolerance techniqueachieves higher failure rates compared to other techniques dueto the relatively large area overhead. Larger area overhead impliesmore faults injected which increases failure rate. The increase inthe number of faults injected has exponential relation with the in-crease in the failure rate.

The area for each experiment in each circuit is shown in Table 7.The most area overhead on average is 236% for OneHD. This is fol-lowed by 175% for TMRTV, 129% for 90% state probability coverage,62% for 50% state probability coverage, and 20% for latch hardening.

From these results, it is concluded that protecting states yield-ing 90% state probability coverage increases the area of the originalcircuit for about 2.3 times. On the other hand, TMRTV increases thearea of the original circuit 2.75 times with higher failure rate. It isworth mentioning that for a certain class of sequential circuitsbelonging to the first class, the proposed approach with 90% stateprobability coverage achieves both significant reductions in failurerate and area overhead compared to TMRTV. For circuits bbsse, cse,keyb, s832, s1494 and styr, the area overhead for 90% state proba-bility coverage is 1.7 times the original circuit.

5. Conclusion

In this work, we have proposed a finite state machine basedfault tolerance technique for sequential circuits. The technique en-hances sequential circuit fault tolerance by introducing redundantequivalent states to the states with high probability of occurrence.By protecting few states with high probability of occurrence thearea overhead is minimized. The proposed technique requires thatthe hamming distance between protected states codes be at least 3to guarantee no overlapping between the redundant states codes.

The experimental results demonstrate the effectiveness of theproposed technique in enhancing sequential circuit fault tolerancewhile minimizing the area overhead. In comparison with triple-modular redundancy with triple voters (TMRTV), one-hot encodingwith duplicate (OneHD) and flip-flop hardening techniques, pro-tecting states with 90% probability of occurrence achieves lowerfailure rates for most compared circuits. The area overhead for pro-tecting states with 90% probability of occurrence is on averageabout 2.3 times the original circuit size in comparison to 2.75 forTMRTV and 3.36 for OneHD. The proposed technique is very effec-tive for a special class of sequential circuits with few states cover-ing over 90% probability of occurrence as it achieves significantfailure rate reduction with minimal area overhead around 1.7times the original circuit size.

Acknowledgment

This work is supported by King Fahd University of Petroleum &Minerals under Project# SB100006.

References

[1] Moore GE. Cramming more components onto integrated circuits. Electronics1965;38:114–7.

[2] Butts M, DeHon A, Goldstein SC. Molecular electronics: devices, systems andtools for gigagate, gigabit chips. In: Proceedings of the IEEE/ACM internationalconference on computer-aided design. ACM; 2002. p. 433–40.

[3] Dixit A, Wood A. The impact of new technology on soft error rates. In:Proceedings of the International Reliability Physics Symposium; 2011. p.5B.4.1–7.

[4] Mavis P, Eaton D. Seu and set mitigation techniques for fpga circuit andconfiguration bit storage design. In: Proceedings of the internationalconference on military and aerospace applications of programmable logicdevices, MAPLD, September 2000.

[5] Mitra S, Seifert N, Zhang M, Shi Q, Kim K. Robust system design with built-insoft-error resilience. Computer 2005;38:43–52.

[6] Krishnamohan S, Mahapatra NR. Analysis and design of soft-error hardenedlatches. In: Proceedings of the 15th ACM great lakes symposium on VLSI; 2005.p. 328–31.

[7] Komatsu Y, Arima Y, Fujimoto T, Yamashita T, Ishibashi K. A soft-errorhardened latch scheme for soc in a 90 nm technology and beyond. In:Proceedings of the IEEE custom integrated circuits conference; 2004. p. 329–32.

[8] Calin T, Nicolaidis M, Velazco R. Upset hardened memory design for submicronCMOS technology. IEEE Trans Nucl Sci 1996;43:2874–8.

[9] Hazucha P, Karnik T, Walstra S, Bloechel B, Tschanz J, Maiz J, et al.Measurements and analysis of ser-tolerant latch in a 90-nm dual-vt CMOSprocess. IEEE J Solid-State Circuits 2004;39:1536–43.

[10] Cassel M, Kastensmidt FL. Evaluating one-hot encoding finite state machinesfor SEU reliability in SRAM-based FPGAs. In: Proceedings of the 12th IEEEinternational on-line testing symposium; 2006. p. 139–44.

[11] Hadjicostis CN, Verghese GC. Coding approaches to fault tolerance in lineardynamic systems. IEEE Trans Inform Theory 2005;51(1):210–28.

[12] Larsen RW, Reed IS. Redundancy by coding versus redundancy by replicationfor failure-tolerant sequential circuits. IEEE Trans Comput 1972;C21(2):130–7.

[13] Levin I, Ostrovsky V, Ostanin S. Self-healing ability of sequential circuits. In:Proceedings of the 22nd conversion of electrical and electronics engineers inIsrael, December 2002. p. 114-6.

[14] Huang ZF, Liang HG. The impact of MBUs on the reliability of rollback recoverycircuits. In: Proceedings of the IEEE circuits and systems internationalconference on testing and diagnosis; 2009. p. 1–4.

[15] Parker K, McCluskey E. Probabilistic treatment of general combinationalnetworks. IEEE Trans Comput 1975;C-24:668–70.

[16] Hachtel G, Macii E, Pardo A, Somenzi F. Probabilistic analysis of large finitestate machines. In: Proceedings of the 31st conference on design automation;1994. p. 270–5.

[17] Savir J, Ditlow GS, Bardell PH. Random pattern testability. IEEE Trans Comput1984;C-33:79–90.

[18] Nicolaidis M. On-line testing for VLSI: state of the art and trends. Integration,The VLSI J 1998;26(1–2):197–209.

[19] Marculescu D, Marculescu R, Pedram M. Trace-driven steady-state probabilityestimation in FSMS with application to power estimation. In: Proceedings ofthe design automation and test in Europe conference; 1998. p. 774–9.

[20] Hachtel G, Hermida M, Pardo A, Poncino M, Somenzi F. Re-encoding sequentialcircuits to reduce power dissipation. In: Proceedings of the IEEE/ACMinternational conference on computer-aided design; 1994. p. 70–3.

[21] El-Maleh A, Sait S, Nawaz Khan F. Finite state machine state assignment forarea and power minimization. In: Proceedings of the IEEE internationalsymposium on circuits and systems, May 2006. p. 5303–6.

[22] Hachtel G, Macii E, Pardo A, Somenzi F. Markovian analysis of large finite statemachines. IEEE Trans Comput-Aided Des Integ Circuits Syst 1996;15:1479–93.

[23] Khan FN. Finite state machine encoding/state assignment for low power,reduced area and increased testability using iterative algorithms. KFUPM:M.Sc.; 2005.

[24] Papoulis A. Random variables and stochastic processes. McGraw-Hill; 1984.[25] http://www.cbl.ncsu.edu:16080/benchmarks/LGSynth89/fsmexamples.[26] Kohavi Z. Switching and finite automata theory. second ed. McGraw-Hill;

1978.[27] Han J, Gao J, Qi Y, Jonker P, Fortes JAB, hardware-redundant Toward. fault-

tolerant logic for nanoelectronics. IEEE Des Test Comput 2005;22(4):328–39.[28] Miskov-Zivanov N, Marculescu D. Soft error rate analysis for sequential

circuits. In: Proceedings of the design automation and test in Europeconference; 2007. p. 1436–41.