Fault Tolerance Exam

Embed Size (px)

Citation preview

  • 8/13/2019 Fault Tolerance Exam

    1/14

    0.4pt0.4pt 0pt0.4pt

    ECE 753 - FAULT-TOLERANT COMPUTING(Spring 2010-2011)

    Examination

    CLOSED BOOK

    Kewal K. Saluja

    Date: April 13, 2011

    Location: Room 3024 Engineering HallTime: 7:15 PMDuration: 90 Minutes

    No PROBLEM POINTS SCORE

    1 General 11

    2 Testing 13

    3 Reliability 8

    4 Reliability 145 System 13

    6 ECC 18

    7 Cyclic Codes 15

    8 Checkpointing 8

    TOTAL 100

    Show your work carefully for both full and partial credit.

    You will be given credit only for what appears on your exam.

    Last Name (Please print): SOLUTION

    First Name:

    ID Number:Page left intentionally blank

  • 8/13/2019 Fault Tolerance Exam

    2/14

    ECE 753: FaultTolerant Computing

    1. (11 points) General questions

    (a) (2 points) Define the term fault secure.

    Answer:Circuit continues to give correct results even in the presence of fault or it producesa non code word for the correct code inputs.

    (b) (2 points) Why self-testingis an important property of self-checking circuits?

    Answer:This property assures that a circuit can be tested using only valid code words asinputs. Thus, it provides ability to test a circuit on-line (during normal operationwithout introducing extraneous and non-code words.

    (c) (2 point) Define the terms checkpointing latency

    Answer:Total time needed to save the checkpoint information on a stable store.

    (d) (1 point) What is an orphan message?

    Answer:A message that has been received but has not been sent, i.e. the message doesnot have a parent (sender yet)

    (e) (2 point) What is the difference between reliabilityand availability?

    Answer:Reliability does not allow going back to an operational state once the system fails,

    where as availability allows such transitions in the system state.(f) (2 point) Name two methods that can reduce the number of signatures in the

    context of watchdog technique.

    Answer:Branch address hashingPath signatures

    2 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    3/14

    ECE 753: FaultTolerant Computing

    2. (13 points) Testing and test generation

    Consider the combinational circuit of Figure 1.

    a

    b

    c

    d

    e

    Out1

    Out2

    A

    B

    C

    D

    Figure 1: A circuit for testing

    (a) (9 points)

    Two faults, f1 and f2 in a circuit are said to be equivalent if the circuit functionat each output in the presence of faultf1is identical to the output in the presenceof fault f2 in the circuit.

    Answer the following and you must show your work.

    i. (3 points) Is the fault at line c stuck-at 1 equivalentto fault at line estuck-at 1?Answer:These two faults are equivalent. For both these faults the out-2 is always1 and out-1 is not affected by these faults.

    ii. (3 points) Is the fault at line a stuck-at 1 equivalent to fault at line bstuck-at 1?Answer:

    These two faults are not equivalent. The fault a stuck-at 1 does not affectthe output out-2 where as line b affects out-2 and is not redundant withrespect to out-2.

    iii. (3 points) Is the fault at line b stuck-at 0 equivalentto fault at line dstuck-at 0?Answer:These two faults are not equivalent. The output out-1 is not same forthese two faults because line d stuck-at 0 does not affect out-1 whereas lineb stuck-at 0 makes out-2 to be 0.

    3 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    4/14

    ECE 753: FaultTolerant Computing

    (b) (2 points) Based on the structure of the circuit, what is the lower bound on themaximum number of variables that would need to be assigned binary values todetect a fault on line b?

    Answer:Two. Because assigning A and B will excite as well as sensitize the fault to out-1.

    (c) (2 points) Based on the structure of the circuit, what is the lower bound on themaximum number of variables that would need to be assigned binary values todetect a fault on line c?

    Answer:Four. This fault can be detected only on out-2 and hence we may have assign allfour variables based only in structural information.

    4 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    5/14

    ECE 753: FaultTolerant Computing

    3. (8 points) Reliability modeling (Reliability Block Diagram)Reliability equation of system with 5 modules; A, B, C, D and E, is given below:

    Rsystem = RC[1 (1 RA)((1 RB)][1 (1 RD)(1 RE)]

    + (1 RC)[1 (1 RARD)(1 RBRE)]

    Draw the reliability block diagram of the system based on the above expression.

    Answer:Figure 2 below

    A

    B

    C

    D

    E

    Figure 2: Non series-parallel system satisfying reliability equation.

    5 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    6/14

    ECE 753: FaultTolerant Computing

    4. (14 points) Reliability modeling (Markov Model)A fault tolerant system is to be modeled using Markov model with three states namelystates O, T, R.

    The following set of differential equations are obtained from the Markov model.

    d

    dtpO(t) = p0(t) + 2pR(t)

    d

    dtpT(t) = p0(t) 1pT(t)

    d

    dtpR(t) = 1pT(t) 2pR(t)

    (a) (3 points) Write the A matrix - the equations in matrix form.Answer:

    d

    dt

    pO(t)pT(t)pR(t)

    =

    0 2

    1 00 1 2

    pO(t)pT(t)pR(t)

    (b) (5 points) Draw the Markov Model of the system.

    Answer:

    See the Figure 3.

    T RO

    1

    2

    Figure 3: Markov chain of a fault tolerant system

    (c) (6 points) The reliability of the above system is obtained my solving the Markovmodel and it is:

    Rel(t) = 1

    e1t 11

    et

    What is the MTTF the system. Use any method you like to solve for this.

    Answer:

    6 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    7/14

    ECE 753: FaultTolerant Computing

    To obtain the MTTF of the system one can compute

    0

    Rel(t)dt

    On integration and then simplification you we will find

    MTTF = 1

    + 11

    7 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    8/14

    ECE 753: FaultTolerant Computing

    5. (13 points) System level diagnosis

    (a) (6 points) One-step t-fault diagnosable systemIt is known that a system of n units in which no two units test each other isone-step t-fault diagnosable if and only if every unit is tested by t other units. Iclaim that the following system with 6 units shown in Figure 4 is one-step 2-faultdiagnosable even though some units test each other.

    1

    2

    3

    4

    5

    6

    Figure 4: A one-step 2-fault diagnosable system

    Prove the above claim by providing convincing argument(s).

    Answer:

    In the above system if the link from node 2 to 3, or form node 3 to 2, or bothlinks 3 are removed the resulting system of 6 units is such that no two nodestest each other. But in this modified system every node is tested by two othernodes and it meets the necessary and sufficient conditions for one-step 2-faultdiagnosis. Hence we can conclude that the system with additional links is also

    one-step 2-faults diagnosable.

    8 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    9/14

    ECE 753: FaultTolerant Computing

    (b) (7 points) Single loop systemConsider a single loop system consisting of 10 units v1, v2, ... v10. In this systemunitv1testsv2,v2testsv3, etc. and finallyv10testsv1. This system is sequentially4 fault diagnosable. For the following syndrome, identify as many faulty units asyou can and provide reasoning for your answer. The syndrome is written withthe outcome of the test v1 v2 as the first element.

    v1 v2 v3 v4 v5 v6 v7 v8 v9 v10

    1 1 1 1 1 1 1 0 0 1

    Answer:In this system we can diagnose all the faulty units in one-step as follows:

    Step 1: Unit v1 must be faulty. Because if it is not, then v2, v8, v9, and v10must be faulty. In addition one unit from each of the pairs (v3,v4), (v5,v6) mustalso be faulty. The reasoning for the second conclusion being if the test outcomebetween a pair of units is 1, then one of the units from that pair must be faulty.Hence v1 is faulty

    Step 2: Form each of the pairs (v2,v3), (v4,v5), (v6,v7) at least one unit must befaulty. This makes a total of four faulty units in the system. Now we assert thatfrom the pair (v6,b7) the unit v7 must be faulty. If not, then in addition to v6being faulty v8 must also be faulty and that will make the total number of faultyunits to be at least 5. Hencev7 is faulty

    Step 3: Use argument similar to Step 2 considering pairs (v2,v3) and (v4,v5) andconclude that v5 must be faulty and v6 must not be faulty.

    Step 4: Will finally lead to unit v3 being faulty.

    Hence the above syndrome leads to the conclusion that units v1, v3, v5, v7 arefaulty.

    9 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    10/14

    ECE 753: FaultTolerant Computing

    6. (18 points) Error detection and correction coding

    (a) (6 points) A (9,6) linear block code with six information bits; i1, i2, i3, i4, i5, i6;and three parity bits; p1, p2, p3; uses the following three equations to realize thethree party bits:

    p1= even parity over even number information bits

    p2=even parity over odd number information bits

    p3=even parity over all information bits

    Write its generator matrix in systematic form in the following table. For yourconvenience I have completed the first row and the first column of the table.

    G=

    i1 i2 i3 i4 i5 i6 p1 p2 p31 0 0 0 0 0 0 1 1

    0

    0

    0

    00

    Answer:In the figure below the blank entries are 0s.

    G=

    i1 i2 i3 i4 i5 i6 p1 p2 p31 0 0 0 0 0 0 1 1

    0 1 1 1

    0 1 1 10 1 1 1

    0 1 1 1

    0 1 1 1

    10 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    11/14

    ECE 753: FaultTolerant Computing

    (b) (12 points) The parity check matrix H of a (11,8) linear block code is givenbelow:

    H= 1 0 1 0 1 0 1 0 1 0 00 1 0 1 0 1 0 1 0 1 0

    1 1 1 1 1 1 1 1 0 0 1

    Now prove or disprove the following. You must provide an example or a goodexplanation.

    i. (2 points) This is a single error correcting code.Answer:

    This is not true. There are many identical columns, e.g. column 1 and threeare same, therefore error in bit 1 or bit 3 will give rise to same syndrome.

    ii. (3 points) It can detect any consecutive two bit errors.Answer:

    This is true. Sum of any two consecutive columns is zero. If we were to enu-merate all cases, we can reduce them by taking advantage of the observationthat first 8 columns are only two different types of columns which alternate.The the remaining cases are easy to enumerate.

    iii. (2 points) It can detect arbitrary two bit errors.Answer:This is false. Errors in bit positions which correspond to identical columnsin H matrix (such as bits 1 and 3) will give zero syndrome and hence will notbe detected.

    iv. (3 + bonus points) It can detect any consecutive three bit errors.Answer:This is true. You can show is it by demonstrating the syndrome for all threeconsecutive errors. However, you can reduce the cases by showing that in thefirst 8 columns of H, we can dispense them by two cases as follows:

    1 0 1 0 0 1 0 1

    0 + 1 + 0 = 1 OR 1 + 0 + 1 = 0

    1 1 1 1 1 1 1 1

    Then there are only three more cases that remain and they also lead to nonzero syndrome.

    v. (2 points) It can detect any consecutive 4 bit errors.Answer:This is false. Consider for example errors in the first four bit location. Thecorresponding syndrome is 0, hence the four bit error will not be detected.

    11 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    12/14

    ECE 753: FaultTolerant Computing

    7. (15 points) Cyclic code

    A 4 bit information word is encoded using a single error correcting (7,4) cyclic code.

    The corresponding decoder (LFSR) is shown in Figure 5. The 7-bit encoded wordis transmitted twice and the two 7-bit received words (before decoding the receivedwords) are the following. In both cases the least significant bit (LSB) is writtenon the right.

    i) 0 1 1 0 1 0 1

    ii) 0 1 1 0 0 0 1

    Encoded input

    Decoded output

    Figure 5: Cyclic code decoder

    Now answer the following:

    (a) (3 points) Write the generating polynomial used for encoding the informationword.

    Answer:The generator polynomial is: g(x) = 1 +x+x3

    (b) (3 points) Write the received word shown in i) above in polynomial form.

    Answer:The received word (encoded word) in polynomial form is: Ri(x) = 1+ x

    2 + x4 + x5

    (c) (6 points) Assuming that no more then single error can occur during transmissionof 7 bits, identify which of the two received word(s) is(are) in error, if any. Useany method you like (such as decoding, polynomial algebra) but you must showthe work you perform to identify the correct and/or the erroneous word(s).

    Answer:We divide the polynomial Ri(x) = 1 +x

    2 +x4 +x5 by g(x) and we find thatthe remainder is x2. This is non-zero, hence we can conclude that this word is inerror.

    12 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    13/14

    ECE 753: FaultTolerant Computing

    Similarly when we divide the polynomial for the second word Ri(x) = 1 + x4 + x5

    by g(x) we find that the remainder is 0. Hence we conclude this is error freereceived word.

    (d) (3 points) What was the 4-bit information word (write LSB to the right) thatwas encoded in the above case.

    Answer:

    To obtain the encoded word, we can decode the error-free received word usingthe decoder provided in the Figure 5. The simulation of the decoding process isshown below. Note that the input is 1 + x4 +x5 i.e. 0110001 (LSB to the right)

    input State Output1 0 0 0 1

    0 1 0 1 1

    0 1 1 1 1

    0 1 1 0 0

    1 0 1 1 0

    1 0 0 1 0

    0 0 0 0 0

    Thus the output will be 0000111. The right most 3 bits are the decoded wordand the left most three bits are remainder (syndrome) after division.

    Hence the encoded word must be 0111 (1 + x+x2)

    13 Spring 2010-11 (LEC: Saluja)

  • 8/13/2019 Fault Tolerance Exam

    14/14

    ECE 753: FaultTolerant Computing

    8. (8 points) Checkpointing and recovery

    Two processes,PandQ, running in parallel exchange messages and take un-coordinated

    checkpoints. The Figure 6 shows all checkpoints taken by the processes PandQ. How-ever it does not show the message exchanges. You are required to create a messageexchange scenario such that if the process p fails at the point marked in the figure,it will cause a domino effect. The scenario created by you must satisfy the followingconstraint on the message exchanges. No process sends more than one message in anyone checkpoint interval and no process receives more than one message in any onecheckpoint interval.

    P

    Q

    PC1 PC2 PC3

    C1 QC2 QC3

    Failuare

    Figure 6: Figure for demonstrating domino effect.

    Answer:

    A possible message exchange scenario is given the Figure 7

    P

    Q

    PC1 PC2 PC3

    C1 QC2 QC3

    Failuare

    Figure 7: Figure for demonstrating domino effect.

    14 Spring 2010-11 (LEC: Saluja)