10
IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987 27 A Course on the Design of Reliable Digital Systems BARRY W. JOHNSON, MEMBER, IEEE Abstract-A graduate course on the design of reliable digital systems tolerant digital systems. Much of the design of fault-tol- is described. One of the unique features of this course is the teaching erant systems involves possessing knowledge of tech- of design methodologies. The concepts learned in class are solidified through a required course project where the students apply the tech- niques that can be applied under certain conditions and niques and the design methodologies studied. The project consists of then determining how those techniques may be used in a developing a preliminary design of a highly reliable and fault-tolerant specific application. Third, the students should be pro- flight-control system for an aircraft. The students must perform nu- vided with practical knowledge and experience in the de- merous design tradeoffs and make design decisions to produce a system sign of reliable digital systems. The two key words are that meets the stringent requirements of the application. A second unique feature of this course is the use of numerous practical examples d of the technology during the discussion of the design and analysis tech- stand the phases of a typical design process and have an niques. opportunity to apply that understanding. It is also impor- tant for students to examine large systems; unfortunately, INTRODUCTION many graduate students complete their education without HE incorporation of digital systems into the vital as- having considered the design of a digital system larger Lpects of almost everyo s le hthan a simple, single-processor one. The final goal is to intest in theot development oife highly rel and fnceaut expose the student to research that is being performed in interest in the development of highly reliable and fault- futtlrneadrlal eins st upr h tolerant designs. It is no longer sufficient for engineers fault tolerance and relable design so as to support the and scientists to create systems that merely perform their resar attis o within the deprtent. funcions deignes mut nw deelopsysems hat lso In an attempt to meet the goals, the course included 1) functions; designers must now develop systems that also nueoseape fidsra plctoso h eh have extremely high reliabilities, self-diagnostics to aid in repair, and, in mayasstnology, 2) student presentations of important papers in in reyafer one, or more faies, the occurred. Graduat the area, and 3) a significant design project that attempted rectly after one or more failures have occurred. Graduat- to soiiymn.'teiprtn ocps hs he ing students in digital design must have some knowledge of 1) failure sources in digital circuits, 2) techniques that aspects of the course are considered to be extremely im- are available to combat failures, 3) methods for designing portant parts of the educational process. In addition, the systems to meet reliability or other specifications, and 4) students' comments reflected their preference for these analysis techniques to verify that a system has met its re- particular features of the course. quirements. Perhaps the most important feature of the course is the This paper describes a course on the design and analysis emphasis on design methodologies for developing reliable of reliable digital systems. The course was developed at and fault-tolerant systems. The students were required to the University of Virginia's Department of Electrical En- use these methodologies in a design project that required gineering to properly prepare graduate students and to the students to perform tradeoffs just as they would in an support research in the Department's Center for Semicus- industrial design environment. The tradeoffs involved not tor Integrated Systems. This paper is not intended to only reliability and performance characteristics but cost, tom ~ ~ ~~~~~~~ ~~~~~egtandgre powerms retrcton pape well The project( prov serve as a tutorial on reliable system design but presents weight, and power restrictions as well The project pr the details of a course on this topic. There are several vided a good opportunity for the students to apply the the- tutorial articles available for the interested reader [1], [2]. ory learned during the class lectures and discussions. COURSE DESCRIPTION COURSE GOALS The course had four primary goals. First, it was desired The course is offered as an intermediate-level graduate to introduce the students to the terminology used in the course available to all masters and doctoral students. The fault tolerance community to allow them to read and un- prerequisite is an understanding of basic digital logic de- derstand the existing literature. Second, the students were sign and computer organization. The course was first to be exposed to modern techniques that are presently taught in the Spring Semester of 1985, and 13 students available for designing and analying reliable- and fault were enrolled. The course was subsequently offered in the Spring Semester of 1986 with an enrollment of 16 stu- Manuscript received June 19, 1985. dents. The textbook used was The Theory and Practice of The author is with the Center for Semicustom Integrated Systems, De- Reliable System Design by D. P. Siewiorek and R. Swarz. partment of Electrical Engineering, University of Virginia, Charlottesville,Tetxbowaslcedpirlyeauetcvrd VA 22901. 'Tetxbo a eetdpialybcuel oee IEEE Log Number 8611704. all of the pertinent topics and had reprints of several im- 0018-9359/87/0200-0027$O1 .00 ©C 1987 IEEE

A Course on the Design of Reliable Digital Systems

  • Upload
    barry-w

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987 27

A Course on the Design of Reliable Digital SystemsBARRY W. JOHNSON, MEMBER, IEEE

Abstract-A graduate course on the design of reliable digital systems tolerant digital systems. Much of the design of fault-tol-is described. One of the unique features of this course is the teaching erant systems involves possessing knowledge of tech-of design methodologies. The concepts learned in class are solidifiedthrough a required course project where the students apply the tech- niques that can be applied under certain conditions andniques and the design methodologies studied. The project consists of then determining how those techniques may be used in adeveloping a preliminary design of a highly reliable and fault-tolerant specific application. Third, the students should be pro-flight-control system for an aircraft. The students must perform nu- vided with practical knowledge and experience in the de-merous design tradeoffs and make design decisions to produce a system sign of reliable digital systems. The two key words arethat meets the stringent requirements of the application. A secondunique feature of this course is the use of numerous practical examples dof the technology during the discussion of the design and analysis tech- stand the phases of a typical design process and have anniques. opportunity to apply that understanding. It is also impor-

tant for students to examine large systems; unfortunately,INTRODUCTION many graduate students complete their education without

HE incorporation of digital systems into the vital as- having considered the design of a digital system largerLpects of almost everyo s le hthan a simple, single-processor one. The final goal is to

intest in theotdevelopment oifehighly rel andfnceaut expose the student to research that is being performed ininterest in the development of highly reliable and fault- futtlrneadrlal eins st upr htolerant designs. It is no longer sufficient for engineers fault tolerance and relable design so as to support theand scientists to create systems that merely perform their resar attis o within thedeprtent.

funcions deignes mut nw deelopsysems hat lso In an attempt to meet the goals, the course included 1)functions; designers must now develop systems that alsonueoseape fidsra plctoso h eh

have extremely high reliabilities, self-diagnostics to aidin repair, and, in mayasstnology, 2) student presentations of important papers in

in

reyafer one, ormore faies,the occurred. Graduat the area, and 3) a significant design project that attemptedrectly after one or more failures have occurred. Graduat-to soiiymn.'teiprtn ocps hs he

ing students in digital design must have some knowledgeof 1) failure sources in digital circuits, 2) techniques that aspects of the course are considered to be extremely im-

are available to combat failures, 3) methods for designing portant parts of the educational process. In addition, the

systems to meet reliability or other specifications, and 4) students' comments reflected their preference for these

analysis techniques to verify that a system has met its re- particular features of the course.

quirements. Perhaps the most important feature of the course is the

This paper describes a course on the design and analysis emphasis on design methodologies for developing reliableof reliable digital systems. The course was developed at and fault-tolerant systems. The students were required to

the University of Virginia's Department of Electrical En- use these methodologies in a design project that requiredgineering to properly prepare graduate students and to the students to perform tradeoffs just as they would in an

support research in the Department's Center for Semicus- industrial design environment. The tradeoffs involved not

tor Integrated Systems. This paper is not intended to only reliability and performance characteristics but cost,tom~~~ ~ ~ ~ ~ ~~~~~~~egtandgre powermsretrcton papewell The project(provserve as a tutorial on reliable system design but presents weight, and power restrictions as well The project pr

the details of a course on this topic. There are several vided a good opportunity for the students to apply the the-

tutorial articles available for the interested reader [1], [2]. ory learned during the class lectures and discussions.COURSE DESCRIPTIONCOURSE GOALS

The course had four primary goals. First, it was desired The course is offered as an intermediate-level graduateto introduce the students to the terminology used in the course available to all masters and doctoral students. The

fault tolerance community to allow them to read and un- prerequisite is an understanding of basic digital logic de-derstand the existing literature. Second, the students were sign and computer organization. The course was firstto be exposed to modern techniques that are presently taught in the Spring Semester of 1985, and 13 studentsavailable for designing and analying reliable- and fault were enrolled. The course was subsequently offered in the

Spring Semester of 1986 with an enrollment of 16 stu-Manuscript received June 19, 1985. dents. The textbook used was The Theory and Practice ofThe author is with the Center for Semicustom Integrated Systems, De- Reliable System Design by D. P. Siewiorek and R. Swarz.

partment of Electrical Engineering, University of Virginia, Charlottesville,TetxbowaslcedpirlyeauetcvrdVA 22901. 'Tetxbo a eetdpialybcuel oee

IEEE Log Number 8611704. all of the pertinent topics and had reprints of several im-

0018-9359/87/0200-0027$O1 .00 ©C 1987 IEEE

28 IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987

portant papers in the area. In addition, it is one of the few TABLE Itextbooks available on this topic. Although the textbook COURSE OUTLINEwas used extensively, the majority of the course material Subject Example Topics' Subject Example Topics ~~~~~~~~~~~~~~~~Classeswas taken from the instructor's own notes, particularlythe examples that were presented in class. Introduction Course objectives and I

outlineCourse Topics Basic Definitions Reliability, availability, I

failure, fault error, faultTable I lists the topics that were covered during the one- tolerance, fault masking,

semester course and the approximate number of class pe- fault avoidance, faultriods devoted to each. The course begins with an intro- Application Areas coverage, maintainabilityApplicaton AreasLong-life, computationIduction to the fundamental terminology. The key con- critical, maintenancecepts of fault, failure, and error have vastly different postponement, and highmeanings in the commonly accepted fault tolerance liter- availabilityExamples of each Fault-Tolerant Spaceborne 2ature. Likewise, it is important for the students to have a application Computer, flight-controlclear understanding of the technical definitions of relia- system, Bell Electronicbility availability, maintainability, safety fault toler- Switching System, and

Tandem Nonstopance, and fault avoidance. computerTo give the students an appreciation for the importance Causes of Faults Specification mistakes, 1

of designing reliable systems, several application areas implementation mistakes,external disturbances, andwere discussed, and examples from each area were de- component failuresscribed. The primary application areas considered in- Design Techniques Fault tolerance versus fault I

cluded long-life applications, computation critical appli- Concept of Redundancy Hardware, information,cations, maintenance postponement applications, and high time, and softwareavailability applications. Examples from each include sat- redundancyellites (long-life), aircraft flight control systems (compu- Hardware Redundancy Passive, dynamic, and 3hybridtation critical), remote processing stations (maintenance Detailed Example Aircraft flight control Ipostponement), and banking systems (high availability). system

. . .. . Student Presentation ~~~~Common-mode hardware1The emphasis at this early point in the course was on the failuresrequirements that the various systems had to meet. For Student Presentation Hardware redundancy in 1example, many flight-control systems must operate after VLSIany two failures and have a reliability of greater than Information Error detecting and error 3any two ~~~~~~~~~~~~~~~~~Redundancy correcting codes0.9999999 throughout a 3-h, or longer, flight. Banking Detailed Example Error correcting memory Isystems, on the other hand, may be designed to decrease Time Redundancy Retry procedures and 3their downtime, through the use of self-diagnostics, by recomputation with

modified operandsdecreasing the average time required to repair the system. Detailed Example Arithmetic logic unit IIn other words, the banking system may be allowed to fail Student Presentation Recomputation with shifted 1

fairly often as long as the repair is quick and easy. The Periodic self-test, validity 2flight-control system and the banking system may use checks, software votingsome of the same basic techniques, but, in most instances, Detailed Example Software-implemented self- I

the design goals and the techniques employed may be test in an avionic systemto a t t. c yReliabilityAnalysis Reliability block diagrams, 4vastly different. Markov models, faultOne crucial topic presented in the class was the concept coverage estimation

of designing for reliability. Many students have the mis- Maintainability (FMEA)n otaken impression that reliability can be added to a system Analysis (MTTR)after the functional design is completed, or that fault tol- Availability Analysis Relationship between I

erance guarantees reliability. The student must clearly un- Detailed Example Reliability analysis of an 2derstand that fault tolerance is one technique used to avionic systemachieve reliability, availability, maintainability, or other Student Presentation Yield modeling and the I

design goals. For fault tolerance to be successful, one impact of redundancy ondesign goals. ~~~~~~~~~~~~~~~~~~~~~~yieldmust design the system with the system requirements and Design Project Top-level design and 3the fault tolerance approaches to meeting those require- analysis of a flight-ments considered at every point during the design pro- Project Presentations One class period per group 4cess.The design approach presented in class is illustrated in

Fig. 1 which shows the techniques used to meet the sys- requirements. A collection of system analysis techniquestem's requirements. Fig. 1 should be viewed hierarchi- such as Markov models is available to support the designcally with system design and system analysis being per- process. In addition, both fault tolerance and fault avoid-formed in parallel to support meeting the system ance are crucial aspects of system design. Examples of

JOHNSON: COURSE ON DESIGN OF DIGITAL SYSTEMS 29

cedure involves negotiating with the individuals that setSystem the requirements and often agreeing to, or requesting,

Requirements concessions. In the classroom environment, the instructoracts as the student's customer and discusses the require-ments of a system.The regular class lectures were supplemented by four

System_

0 Systemstudent presentations. The students were organized into

Design AnlSystem groups, and each group was asked to study a paper fromDesig tAnalySIS the current literature and give a 50 min (one class period)

presentation on that paper. The papers covered common-- Foult Molysis mode failures in duplicated modules [3], fault tolerance- Tolerance Analysis implemented in VLSI [4], the use of time redundancy in- Testing an arithmetic logic unit [5], and the use of redundancy to

Fault Fault - Environmental Study improve the yield of VLSI devices [6]. All of the studentsAvoidanceTolerance

Failure Modes[6Avoidance | Tolerance | t Combinatorial Models were required to read each paper so that everyone wasFailure Data prepared to participate in the discussion. One of the im-

- Parts Selection - Hardware Redundoncy portant aspects of the paper presentations was the expe-

- Design Reviews - Software Redundancy- Quality Control - Information Redundancy rience that the students gained in giving on oral presen-- Design Methodology Time Redundancy tation and responding to questions from an audience.- Design Rules - Fault Detection

Shielding - Fault Containment The class project was conducted over a four-week pe-Heat Sinks - Reconfiguration riod and required both a written report and an oral pre-Documentation sentation. The project was organized as a request for pro-Fig. 1. Reliable digital system design methodology. posal (RFP). The students were once again placed in

groups, and each group worked in competition with thefault tolerance and fault avoidance approaches are listed others. The scenario was one of multiple companies re-in Fig. 1. Even though they are often overlooked, fault sponding to the same RFP from the government or someavoidance and system analysis are just as important as other company. The groups treated the instructor as thefault tolerance and must be an integral part of the design agency that developed the RFP and held regularly sched-process. uled meetings with the instructor to discuss ideas andAnother key concept that must be conveyed to the stu- strategies, just as one would in an industrial setting. The

dents is that the design process naturally involves per- students responded to this type of project in a very posi-forming numerous tradeoffs and making a number of de- tive manner. The competition between groups wassign decisions. During the design of complex systems, friendly but at the same time fierce. Each group devotedthere is seldom a perfect solution. Many times the de- significant amounts of time searching for a novel ap-signer must sacrifice weight, power, or size to achieve a proach that would give them an edge over their competi-certain reliability. On the other hand, we may have to sac- tors.rifice reliability to meet weight or cost goals which, insome applications, may be more critical than slight im- Instructional Examplesprovements in the system's reliability. As previously mentioned, several examples were pre-The student is guided to proceed iteratively to progress sented during the lectures to give the students some ex-

from preliminary, candidate approaches to refined and posure to the practical application of the technology. Twoworkable solutions. The first two phases of the design of those examples are presented and discussed in this pa-process are shown in Fig. 2. Phase 1 involves defining per.the problem, partitioning the problem into manageable Triple Modular Redundancy Example: One of the mostpieces, and creating and negotiating requirements. Phase common techniques used to achieve fault tolerance and to2 of the design process involves formulating basic ap- improve a system's reliability is to triplicate certain mod-proaches, analyzing the candidate approaches, and mak- ules and vote on the results produced by each module.ing tradeoffs to meet the requirements. In many cases, the This approach is called Triple Modular Redundancydesigners of a system must return to Phase 1 and rene- (TMR) and is illustrated in Fig. 3. The most commongotiate requirements as a result of the evaluations per- question raised about TMR concerns the effect that theformed in Phase 2. voter has on the system's fault tolerance and reliability.

Subsequlent phases of the design process address spe- As illustrated in Fig. 3, the failure of the voter will resultcific hardware and software design, implementation, and in the failure of the complete system. To overcome thetesting issues. During the progression from a preliminary problems of the voter, the voter itself is often triplicatedapproach to a refined solution, the designer must often such that the system will produce three correct results evencompromise, negotiate specific requirements, or sacrifice if one of the modules fails. This approach is commonlysome of the system's attributes to develop the most nearly called triplicated TMR and is illustrated in Fig. 4.Optimal solution to a particular problem. The design pro- Even though the approach of Fig. 4 may solve some of

30 IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987

Negotiation Evalua ton

Problem Requie rp-levelProblem TrodeoffsDefiniton men Des'g

ProblemPartition Analysis

Phase 1 Phase 2

Fig. 2. Top-level system design process.

Input1Module 1 ~~~~~~~ ~~~~~~~~~Input1 ProcessingFeedback 1 M

Input 2 Module 2 V er Output Input 2 ProcessingFeedback 2Mor

Input 3 Module 3 Input 3 Processing FeedbackFeedback 3

Mdlinl

Fig. 3. Triple modular redundancy. Fig. 5. The basic flux-summing technique.

Input 1 Module 1 V Output 1 ming is to use the inherent properties of the closed-loopfeedback system to compensate for any failures that mightoccur. The flux-summer is a transformer that has threeprimary windings and a single secondary winding. The

Input 2 Module 2 Voter Output 2 current produced in the secondary winding is proportionalto the sum of the currents in the three primary windings.Each module receives command and feedback signals thatare derived independently from the signals for the re-

Input 3 D Module 3 Voter Output 3 maining modules. Under failure-free circumstances, eachmodule provides approximately one-third of the total cur-

Fig. 4. Triple modular redundancy with triplicated voters. rent to the motor.

If module A fails and starts producing a maximum out-the difficulties, the designer still has a problem if a single put current of imax amperes, the remaining two fault-freemotor or other, device has to ultimately be controlled; modules automatically compensate for the faulty condi-somewhere within the system, a single result has to be tion because their error signals begin to indicate that thecreated in a manner that minimizes the sacrifice in relia- output of the motor is not at the appropriate value. Thebility. Techniques are available to solve such problems combination of the current summation process and theand have been successfully applied in critical applications feedback system allows the fault-free modules to offsetsuch as the space shuttle [7]. Unfortunately, discussions the current of the one faulty module. It should be notedon these types of problems seldom make it into the class- that the flux-summing approach is not a voting arrange-room. ment but has the same effect of masking a single failure.The technique presented in this course is called flux- The transformer system can be designed very reliably and

summing. Fig. 5 illustrates the basic idea of the flux-sum- is extremely resistant to external interference.ming approach when applied to the control of the arma- Other approaches similar to flux-summing have beenture current of a small motor. The principle of flux-sum- developed and were briefly discussed in class. For ex-

JOHNSON: COURSE ON DESIGN OF DIGITAL SYSTEMS 31

Processing Local Remote 2221Unit 2 Unit 2 Unit 2

Bus A

Bus B

Processing Local RemoteUnit 1 Unit 1 Unit 1

Fig. 6. Architecture of a dual-redundant system.

ample, the space shuttle uses a force-summing technique[7] while position summing and active standby arrange- 2ments have also been proposed [11].

Reliability Analysis Example: Another important part Fig. 7. Failure transitions from initial state.of designing a reliable system is the analysis of the systemand the verification that one has met the reliability re-quirements. Several techniques are available to accom-

099

plish reliability analysis. Perhaps the most common reli- o.9998ability analysis technique is the use of reliability blockdiagrams. While effective, the reliability block diagramsare often difficult to construct, particularly if the system 0.9997is complex. A more powerful tool is the Markov reliabil- 096 0ity modeling technique. Markov models fit well the com-mon assumption of constant failure rates within a system.While it is easy and educational to create Markov

models for simple systems such as a TMR design, it is0.9994much more meaningful to show students an actual exam- 0 1 2 3

ple of a reasonably complex design. In the course de- Mission Time in Hoursscribed in this paper, a Markov model of a dual-redundant r v t

' ~~~~~~~~~~~~~~~~~Fig.8. System reliability versus time.system with four distinct types of electronic subsystemswas presented. The complete model had a total of 81 statesbut could be reduced to 17 states if equivalent states were ing transitions are those that go from state (2222) to stateeliminated. The dual-redundant architecture that was ana- (1122) and those that go from state (2222) to the failedlyzed was used in an aerospace application [8], [9], but state. As previously mentioned, the transition from (2222)the same general architecture has been applied to tele- to (1122) occurs as a result of a power supply failure thatphone switching systems, banking systems, and satellite disables both a processing unit and an interface unit.applications [10]. Transitions from state (2222) to the failed state can oc-The general architecture of the system to be analyzed cur as a result of faults that are not detected and appro-

is shown in Fig. 6. The system contains two main pro- priately handled. Because the system is dual-redundant,cessing units, two buses for communication, two local failures must be detected and a reconfiguration imple-interface units, and two remote interface units. The inter- mented before the system can continue operating cor-face units contain circuits such as analog-to-digital and rectly. Therefore, the fault coverage factor plays an im-digital-to-analog converters. A unique aspect of the sys- portant role in the overall reliability of the system.tem that must be accounted for in the reliability analysis The equations for the Markoy model may be written inis that the local interface units are housed in the same matrix form asboxes as the processing units and are powered from the P(t + At) = AtAP(t)same supply. Therefore, the failure of one power supplycan result in the failure of two functional components of where P is the probability state vector, At is the time step,the system; a processing unit and a local interface unit. and A is the state transition matrix. By solving the equa-The state transitions that can occur from the initial state tions of the Markov model, one can determine the prob-

of the Markov model when failures occur are shown in ability of the system being in the failed state.Fig. 7. The system starts in state (2222) which corre- Fig. 8 shows the reliability of the system as a functionsponds to the condition of all units functioning correctly. of time, and Fig. 9 shows the probability of system failureAs units fail, transitions occur to the appropriate states at the end of a 3-h time period as a function of fault coy-with the transition probabilities being characterized by the erage factor. Figs. 8 and 9 were created using failure ratefailure rates and the fault coverage factors. The interest- data from an actual implementation of the system [8], [9].

32 IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987

-2.00 -_All ModulesOperating

-3.00 - ailureLogarithm of No YesFailureovrProbability -4.00

\ ~~~~~TwoModules Two Modules\ ~~~~~Operating Operating

-5.00 l l l rFailure Failure92.0 94.0 96.0 98.0 100.0 No Yes

Coverage Factor in Percent one Module

Fig. 9. Probability of failure versus coverage. OperatingFailureL

Figs. 8 and 9 were used to illustrate to the students thetypes of results that are required in an actual reliability Fig. 10. Failurediagramfortriple-duplexapproach.analysis.

Class Project -5.00 / Tnrple-duplexThe class project was designed to require that the stu- I

dents use the design methodologies, techniques, and anal-ysis methods learned in class. The most important point Logarithm ofof the project was to get the students to perform design Failure /uadtradeoffs and make decisions as a result of those tradeoffs. Probability _7.00The students were concerned about not only reliability andfault tolerance characteristics but important factors such 3-of-sas weight, cost, power consumption, testability, and the -8.00 9 I I I92.0 94.0 96.0 98.0 100.0

ease of design. The project was constructed such that therewere several acceptable solutions, and the manner in Coverage Factor in Percentwhich the students performed the tradeoffs determined

Fig. 11. Failure probability comparisons of three approaches.which solution was selected. The Appendix presents theproject description as it was given to the students.To provide the reader with some information on the type the group generated a Markov model that incorporated suf-

of work performed by the students, several example anal- ficient states to account for the system remaining opera-yses are presented. The most common alternative to the tional because of fault masking but actually containing antwo approaches suggested in the RFP was a three-of-five undetected fault. For example, Fig. 12 shows the transi-system using a flux-summing arrangement. One group that tions that can result when failures occur in the triple-du-recommended the three-of-five approach developed fail- plex approach. The system is in the undetected-fault stateure diagrams for each of the three candidate systems that when the flux-summing has tolerated a fault, but the faultthey analyzed. For example, the failure diagram for the has not been detected. This type of analysis is very usefultriple-duplex approach, as shown in Fig. 10, clearly il- because the faults may remain in the system at the startlustrates what can happen in the triple-duplex system when of the next flight, thus reducing the system's reliabilityfailures occur. during subsequent flights.

Reliability models for each of the three approaches were Additional groups investigated other important topicsdeveloped by the students, and a comparison of the reli- such as the arrangement and number of sensors requiredability as a function of failure rates and fault coverage in a given system architecture. For example, the triple-factors was performed. Fig. 11, for example, shows how duplex approach can use one sensor for each processor,the failure probabilities vary as a function of the fault cov- each processor pair might share a sensor, or each of theerage factor for the triple-duplex, quad, and three-of-five six processors might receive the signals from each of threeapproaches. The key selling point for the three-to-five sensors and vote to determine which sensor value to use.system is that its failure probability is essentially inde- Each approach has advantages and disadvantages that im-pendent of the fault coverage factor, thus making the sys- pact the reliability of the system. The group consideredtem easier to design, build, and test. The three-of-five ap- several important tradeoffs during their discussion of theproach does not require fault detection to tolerate the first sensor implementation.and second failures. Finally, each group performed tradeoffs concerning theOne group of students investigated, in more detail, the power, weight, size, and cost of the candidate ap-

impact that the fault coverage can have on the overall re- proaches. The written reports and the oral presentationsliability of a given approach. To accomplish the analysis, included trade matrixes that compared the salient features

JOHNSON: COURSE ON DESIGN OF DIGITAL SYSTEMS 33

cent), a paper presentation (10 percent), the project (30(3 ) >( 2 ) >( 1 percent), and a final exam (20 percent). Both the final

exam and the midterm test consisted of a closed-book por-tion that tested the student's general understanding of ter-minology and fundamental concepts, and an open-bookportion that consisted of small design. problems.

SUGGESTED COURSE MODIFICATIONSAt the end of any course, instructors and students are

usually full of ideas for improving and upgrading the{Undetected >( System A course. The student response to this course was positiveFailure Fai and encouraging. This course was, in many cases, the first

Fig. 12. Failuretranstionsfortipleexposure that the students had to top-level system designFig. 12. Failure transitions for triple-duplex approach.and analysis. Rather than being concerned with individualcircuits, the students were examining a system architec-

(reliability, weight, power, size, testability, and difficulty ture, in many cases, and were concerned with developingof design) for each candidate approach. It is interesting to that architecture to meet many overall (often conflicting)note that two of the four groups proposed three-of-five requirements. This approach to teaching is also being usedsystems while the remaining two recommended modified in other courses within the University of Virginia's De-versions of the triple-duplex concept. partment of Electrical Engineering. For example, a course

Course Grading on computer design is currently offered that emphasizesthe design methodologies and the importance of tradeoffs

As in any design course, grading is an extremely diffi- in the design process. It is important that students be ex-cult and time-consuming problem; there is no single, cor- posed to these types of concepts early in their education.rect solution to homework problems, tests, projects, and The most fundamental need that the course has is forfinal exams. Therefore, most grading must be based upon new homework assignments. Unfortunately, the materialthe instructor's perception of the appropriateness and does not often lend itself to the construction of assign-completeness of the students' assumptions and design ap- ments that can be performed in a reasonable length ofproaches. The class project, for example, was graded just time. The author is currently developing new assignmentsas a proposal would be reviewed with five primary issues that will be suitable for the next offering of the course.being considered. First, the students were expected todemonstrate an understanding of the technical issues in- CONCLUSIONSvolved in the system design and the candidate approaches. Design courses typically consume large amounts of bothThe students provided evidence of their understanding the instructor's and the students' time. Based on discus-through tutorial material provided in the introductions of sions with the students, most seem to have devoted antheir reports and in the technical descriptions of the pro- average of about ten hours per week to the class, in ad-posed systems. This represented 20 percent of the project dition to the three hours of class time each week. Thegrade. project required approximately 50 hours for the students

Second, the project was evaluated on the completeness to complete, but the project was performed over the lastof the design tradeoffs. The students should have pre- four weeks of the course. The paper presentations typi-sented quantitative comparisons of reliability, size, cally required about 20 hours for the students to read, un-weight, cost, power, throughput, and maintainability derstand, and analyze the paper and prepare a profes-while presenting qualitative comparisons of ease of im- sional presentation. Finally, the test and final exam wereplementation, ease of testing, and overall flexibility of the performed outside of class and required about eight hoursdesign. Because the tradeoffs are crucial to the design pro- each.cess, they represented 40 percent of the project grade. In addition to the three hours per week in the class-

Third, the students were evaluated on the uniqueness of room, the instructor spent an average of approximatelytheir approach. Typically, a proposal without any novel six hours per week preparing lecture material, three hourssolutions has a lower probability of being funded than one per week grading homework assignments or tests, and fivethat presents a good unique approach. Uniqueness ac- hours per week discussing the material with the studentscounted for 10 percent of the project grade. outside of class. The time devoted to the course varied as

Finally, the students were evaluated on the quality of a function of the particular assignments that were giventheir written and oral presentations. The ability to suc- to the class. For example, approximately four hours werecessfully communicate ideas and results is extremely im- required to grade each of the four projects that were com-portant in the engineering field and represented 30 percent pleted by the student groups. During the project, signifi-of the project grade. cantly more time was devoted to discussions with the stu-The final course grade was derived from weekly home- dents concerning the project.

work assignments (20 percent), a midterm test (20 per- The amount of time required of the instructor will vary

34 IEEE TRANSACTIONS ON EDUCATION, VOL. E-30, NO. 1, FEBRUARY 1987

as a function of the instructor's background. It is certainly Rollbeneficial if the instructor has experience in the design of Actuatorreliable and fault-tolerant digital systems, however, thatbackground is not completely necessary. To minimize theamount of external reading that the instructor must per- Crew _ Yawform to prepare class lectures, homework problems, and Commands Actuator

exams, the instructor's background should include 1) basicdigital design, 2) microprocessors, 3) some probabilitytheory (Markov modeling), and 4) some knowledge of Pitch

VLSI. ActuatorThe design of reliable and fault-tolerant digital systems

is an important part of our research program at the Uni-versity of Virginia, and the course described in this paper TABLE IIis the cornerstone of the educational process that both FLIGHT CONTROL FUNCTIONS KIPS (THOUSANDS OF INSTRUCTIONS PERcomplements and supports that research. Based on the SECOND)students' comments and their performance on subsequent Functions Processing Requirementsresearch projects and graduate comprehensive examina- Pitch Control 26 KIPStions, the course goals as previously stated, have been Roll Control 26 KIPSmet. Yaw Control 26 KIPSFlap Control 20 KIPS

Thrust Control 40 KIPS

APPENDIX Power Control 40 KIPSMode Control 35 KIPS

PROJECT DESCRIPTION Executive 10 KIPSREQUEST FOR A PROPOSAL ON AN AIRCRAFT FLIGHT

CONTROL SYSTEM PROBLEM DESCRIPTION

RELIABLE AIRCRAFT CORPORATIONRELIABLETAIRCRAFT CORPORATION The architecture that must be developed is required toCHARLOTTESVILLE, VA implement the flight-control laws for the tactical fighter.

MARCH 26, 1985The primary requirements placed on the architecture are

All written proposals and copies of transparencies for the achievement of a reliability of 0.9999999 for a 3-han oral presentation are due April 19, 1985. Oral presen- mission and the capability to tolerate any two failures. Atations are to be made on either April 24, 26, 29, or May block diagram of the nonredundant flight control system1,1985. is shown in Fig. 13. We are asking you to develop a fully

redundant, fault-tolerant, and reliable version of this ar-INTRODUCTION chitecture. Your proposal should include design philoso-

The Reliable Aircraft Corporation is presently design- phies, block diagrams, a reliability analysis, complete de-ing the next generation military tactical fighter and re- scriptions of the fault detection capabilities, and proposedquests your help in developing a candidate flight-control implementations (again at the block diagram level). Aarchitecture. This document describes the areas in which quantitative and qualitative evaluation of your approachwe request your assistance. Because we are still in the should be presented. It should be clear from reading yourearly phases of this development, we request that you pre- proposal exactly why you feel your approach is superiorsent any design philosophies that support the architecture to any others that might be presented.that you propose. Although we are not expecting you to You are not expected to completely design your systemdefine specific hardware, your long-range goals should be but instead present a top-level description and evaluationto reduce cost, weight, size, and power consumption while of the approach. Do not become involved in the specificsmaximizing reliability and maintainability. of the circuits but remain at a block diagram level.We hope the solutions that you present will incorporate The functions that the flight-control processors must

leading-edge technology because we are defining a sys- implement are presented in Table II where the processingtem that will not go into production until about 1992 and requirements of each task are also detailed. Your designshould have a useful life of about 20 years. The bottom should provide approximatley 50 percent spare processingline is that we do not want the design to be obsolete before resources to allow for future growth.we go to production. We encourage frequent contact with The tactical fighter is being developed for the Air Forcethis office to guarantee that you are devoting your re- and they have requested that we consider two candidatesources to the most pertinent problems in the most effi- approaches that they have been investigating. You shouldcient manner. Please feel free to contact us at any time critique these two approaches as part of your developmentprior to your presentation. process. The architecture that you develop into a solution

This document represents the first of several that you may be based on one of the two suggested by the Air Forcewill receive during the course of this project. Preliminary or one of your own. Your proposal should provide theinformation is provided here to allow you to begin to for- most detail on the approach that you ultimately select, butmulate questions and ideas. you need to justify your selection.

JOHNSON: COURSE ON DESIGN OF DIGITAL SYSTEMS 35

> Processor 1A TABLE IIIInputs 1A-B Compare SUGGESTED OUTLINE OF PROPOSAL

Processor 1lBSection Topic

1.0 INTRODUCTION2.0 DESCRUIPON OF SUGGESTED APPROACHES

Inputs2A-BProcessor 2A > Compare 3.0 DESCRIPON OF UNIQUE APPROACH (if any)Inputs2 Processor 2BC

4.0 COMPARISON OF APPROACHES

To 5.0 RECOMMENDED APPROACH> Flux 6.0 RECOMMENDED IMPLEMENTATION

Summer 7.0 EVALUATION OF THE RECOMMENDED APPROACH

>O Processor 3A >8.0 CONCLUSIONS

Inputs 3A-B CompareProcessor 38

>_ any unique approaches that you might develop. EmergingFig. 14. Triple-duplex approach to the flight control architecture. technologies, such as VLSI, and techniques, such as mul-

tiprocessor architectures, that might significantly impactthe final implementation are of interest. Your analysis

Input 1 Processor 1 should discuss the availability of these new technologiesas well as their impact (primarily cost, speed, and relia-bility) on the resulting design.

Input 2 Processor 2 The outline shown in Table III is suggested for yourIntercom To proposal. Remember, this is a suggested outline and need

BsFlux

Bus Summer not be followed explicitly. If you feel there are otherInput 3 Processor 3 things that need to be included then you should include

them. Copies of the articles referenced herein are avail-able from this office upon request.

Input 4 Processor 4

ACKNOWLEDGMENTFig. 15. Quad approach to the flight control architecture.

The author would like to thank the students who partic-

The first candidate approach is called the triple-duplex ipated in the first offering of this course. Their questions,method and is illustrated in Fig. 14. Its fundamental char- enthusiasm, and hard work made the teaching of thisacteristic is the use of triple modular redundancy (TMR) course a pleasure. In addition, the author would like towhere each individual module is constructed using the du- thank Dr. J. H. Aylor for his constructive suggestionsplication with comparison technique. When the compar- concerning the content of this paper.ison process detects a failure, the affected module is re-moved from the voting process. The voter is implemented REFERENCESusing a flux-summing approach. The triple-duplex ap- [1] A. Avizienis, "Fault tolerance: The survival attribute of digital sys-proach is similar to one developed by Honeywell [12]. tems," Proc. IEEE, vol. 66, pp. 1109-1125, Oct. 1978.

The second candidate approach is called the quad re- [2] B. W. Johnson, "Fault-tolerant microprocessor-based systems,"dundancy method and is illustrated in Fig. 15. Its basic IEEE Micro, vol. 4, pp. 6-2 1, Dec. 1984.

[3] Y. Tamir and C. H. Sequin, "Reducing common mode failures infeature is the use of four modules that exchange data and duplicate modules," in Proc. Int. Conf. Comput. Design, 1984, Portindependently vote (in software) to form four voted re- Chester, NY, Oct. 8-11, pp. 302-307.sults. The four results are then flux-summed at the actua- [4] R. M. Sedmak and H. L. Liebergot, "Fault tolerance of a general

purpose computer implemented by very large scale integration," IEEEtor drives. The quad redundancy method depends on the Trans. Comput., vol. C-29, pp. 492-500, June 1980.exchange of information among processors to determine [5] J. H. Patel and L. Y. Fung, "Concurrent error detection in ALUs bywhich processors have failed. This approach is similar to recomputing with shifted operands," IEEE Trans. Comput., vol. C-

31, pp. 589-595, July 1982.the technique used in the space shuttle flight control sys- [6] T. E. Mangir and A. Avizienis, "Fault-tolerant design for VLSI: Ef-

tem [7]. fect of interconnect requirements on yield improvement of VLSI De-

Your analysis of the two above approaches should in- signs," IEEE Trans. Comput., vol. C-31, pp. 609-615, July 1982.[7] J. R. Sklaroff, "Redundancy management technique for the space

dlude the following: shuttle computers," IBM J. Res. Development, pp. 20-27, Jan. 1976.

implementation approaches and problems [8] B. W. Johnson and P. M. Julich, "Fault tolerant computer systemilmplementation approaches and problems for the A129 helicopter," IEEE Trans. Aerosp. Electron. Syst., vol.

* reliability comparisons AES-21, pp. 220-229, Mar. 1985.* weight, size, and power estimates [9] -, "Reliability analysis of the A129 integrated multiplex system,"

. .. ~~~~~~~~~~~~~inProc. Nat. Aerosp. Electron. Conf., Dayton, OH, May 1984, pp.* testability comparisons 1229-1236.* flexibility of the architecture [10] D. P. Siewiorek and R. Swarz, The Theory and Practice of Reliable* amount of redundancy required System Design. Beford, MA: Digital Press, 1982.

* otherattribtesyou feel re important.[11] C. S. Droste and J. E. Walker, "The general dynamics case study on* otherattrlbutesyou feel are lmportant.the F-16 fly-by-wire flight control system," AIAA Professional StudySeries, Amer. Instit. Aeronautics and Astronautics.

The tradeoff analysis that you present should form a [12] K. Driscoll, ";Multi-microprocessor flight control system, 1982," incomparative analysis of the two suggested approaches and Proc. Fifth Annu. Digital Avionics Conf., Seattle, WA, Nov. 1983.

36 IEEE TRANSACTIONS ON EDUCATION. VOL. E-30, NO. 1, FEBRUARY 1987

Si E g Barry W. Johnson (S'78-M'80-S'80-M'82) re- ticipated in the design and analysis of fault-tolerant aerospace systems. Hisceived the B.S., M.E., and Ph.D degrees in elec- research interests include fault-tolerant computing, VLSI architectures,trical engineering from the University of Virginia, VLSI testing, and microcomputer control. He has published approximatelyCharlottesville, in 1979, 1980, and 1983, respec- 25 papers in the above fields and is currently preparing a text on fault-tively. tolerant computing.

He is currently an Assistant Professor in the Dr. Johnson is a member of the IEEE Computer Society, Tau Beta Pi,Department of Electrical Engineering at the Univ- Eta Kappa Nu, and Sigma Xi. He currently serves as the Chair of Finance

| erisly of Virginia and is a member of the Depart- for the Computer Society's Conferences and Tutorials Board and is a mem-ment's Center for Semicustom Integrated Sys- ber of the Editorial Board of IEEE Micro.tems. Prior to joining the University, he was withHarris Corporation, Melbourne, FL where he par-