154
NUREG/CR-6101 UCRL-ID-114839 Software Reliability and Safety in Nuclear Reactor Protection Systems Prepared by J. D. Lawrence Lawrence Livermore National Laboratory Prepared for U.S. Nuclear Regulatory Commission

Software Reliability and Safety in Nuclearconsiderations. Second, the process of engineering reliability and safety into a computer system requires activities to be carried out throughout

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • NUREG/CR-6101UCRL-ID-114839

    Software Reliability andSafety in NuclearReactor Protection Systems

    Prepared byJ. D. Lawrence

    Lawrence Livermore National Laboratory

    Prepared forU.S. Nuclear Regulatory Commission

  • AVAILABILITY NOTICE

    Availability of Reference Materials Cited in NRC Publications

    Most documents cited In NRC publications will be available from one of the following sources:

    I. The NRC Public Document Room, 2120 L Street, NW. Lower Level, Washington. DC 20555-0001

    2. The Superintendent of Documents, U.S. Government Printing Office. Mail Stop SSOP, Washington.DC 20402-9328

    3. The National Technical Information Service, Springfield, VA 22161

    Although the listing that follows represents the majority of documents cited In NRC publications, It Is notIntended to be exhaustive.

    Referenced documents available for Inspection and copying for a fee from the NRC Public Document RoomInclude NRC correspondence and internal NRC memoranda; NRC Office of Inspection and Enforcementbulletins, circulars, Information notices, Inspection and Investigation notices; Licensee Event Reports; ven-dor reports and correspondence; Commission papers; and applicant and licensee documents and corre-spondence.

    The following documents In the NUREG series are available for purchase from the GPO Sales Program:formal NRC staff and contractor reports, NRC-sponsored conference proceedings, and NRC booklets andbrochures. Also available are Regulatory Guides, NRC regulations in the Code of Federal Regulations, andNuclear Regulatory Commission Issuances.

    Documents available from the National Technical Information Service include NUREG series reports andtechnical reports prepared by other federal agencies and reports prepared by the Atomic Energy Commis-sion, forerunner agency to the Nuclear Regulatory Commission.

    Documents available from public and special technical libraries Include all open literature items, such asbooks, journal and periodical articles, and transactions. Federal Register notices, federal and state legisla-tion, and congressional reports can usually be obtained from these libraries.

    Documents such as theses, dissertations, foreign reports and translations, and non-NRC conference pro-ceedings are available for purchase from the organization sponsoring the publication cited.

    Single copies of NRC draft reports are available free, to the extent of supply, upon written request to theOffice of Information Resources Management, Distribution Section, U.S. Nuclear Regulatory Commission,Washington, DC 20555-0001.

    Copies of Industry codes and standards used in a substantive manner In the NRC regulatory process aremaintained at the NRC Library, 7920 Norfolk Avenue, Bethesda, Maryland, and are available there for refer-ence use by the public. Codes and standards are usually copyrighted and may be purchased from theoriginating organization or, If they are American National Standards, from the American National StandardsInstitute, 1430 Broadway, New York, NY 10018.

    DISCLAIMER NOTICE

    This report was prepared as an account of work sponsored by an agency of the United States Government.Neither the United States Government nor any agency thereof, or any of their employees, makes any warranty,expresed or implied, or assumes any legal liability of responsibility for any third party's use, or the results ofsuch use, of any information, apparatus, product or process disclosed in this report, or represents that its useby such third party would not infringe privately owned rights.

  • NUREG/CR-6101UCRL-ID-114839

    Software Reliability andSafety in NuclearReactor Protection Systems

    Manuscript Completed: June 1993Date Published: November 1993

    Prepared byJ. D. Lawrence

    Lawrence Livermore National LaboratoryLivermore, CA 94550

    Prepared forDivision of Reactor Controls and Human FactorsOffice of Nuclear Reactor RegulationU.S. Nuclear Regulatory CommissionWashington, DC 20555-0001NRC FIN L1867

  • ABSTRACT

    Planning the development, use and regulation of computer systems in nuclear reactor protection systems in such away as to enhance reliability and safety is a complex issue. This report is one of a series of reports from theComputer Safety and Reliability Group, Lawrence Livermore National Laboratory, that investigates differentaspects of computer software in reactor protection systems. There are two central themes in the report. First,software considerations cannot be fully understood in isolation from computer hardware and applicationconsiderations. Second, the process of engineering reliability and safety into a computer system requires activities tobe carried out throughout the software life cycle. The report discusses the many activities that can be carried outduring the software life cycle to improve the safety and reliability of the resulting product. The viewpoint isprimarily that of the assessor, or auditor.

    iii iii NUREG/CR-6101

  • CONTENTS

    1. 1. Purpose .1. ..... ......1.2. o e.............................Scope............................................ 1....

    1.3. e o t ra iz ton...Repor..................Organ................za........ion........... .2...2. Terminology............................................................................................................. 3

    2.1. Systems Terminology.......................................................................................... 32.2. Software Reliability and Safety Terminology ...................................................... I.......... 3

    2.2.1. Faults, Errors, and Failures ............................................................................ 32.2.2. Reliability and Safety Measures ...................................................................... 42.2.3. Safety Terminology .................................................................................... 5

    2.3. Life Cycle Models ............................................................................................. 62.3. 1. Waterfall Model........................................................................................ 72.3.2. Phased Implementation Model ........................................................................ 72.3.3. Spiral Model..........................................................................................7..

    2.4. Fault and Failure Classification Schemes .................................................................... 72.4.1. Fault Classifications .................................................................................. 122.4.2. Failure Classifications ................................................................................ 14

    2.5. Software Qualities............................................................................................. 153. Life Cycle Software Reliability and Safety Activities ............................................................. 17

    3.1. Planning Activities ........................................................................................... 173.1.1. Software Project Management Plan.................................................................. 193.1.2 Software Quality Assurance Plan...................................................................... 213.1.3. Software Configuration Management Plan.......................................................... 233.1.4. Software Verification and Validation Plan .......................................................... 263.1.5. Software Safety Plan.................................................................................. 303.1.6. Software Development Plan.......................................................................... 333.1.7. Software Integration Plan............................................................................. 353.1.8. Software Installation Plan ............................................................................ 363.1.9. Software Maintenance Plan .............. :............................................................ 373.1.10. Software Training Plan ............................................................................. 38

    3.2. Requirements Activities ...................................................................................... 383.2.1. Software Requirements Specification................................................................ 383.2.2. Requirements Safety Analysis ....................................................................... 43

    3.3. Design Activities ............................................................................................. 443.3. 1. Hardware and Software Architecture ..............................................453.3.2. Software Design Specification....................................................................... 453.3.3. Software Design Safety Analysis .................................................................... 47

    3.4. Implementation Activities.................................................................................... 483.4.1. Code Safety Analysis................................................................................. 48

    3.5. Integration Activities ......................................................................................... 493.5.1. System Build Documents............................................................................. 493.5.2. Integration Safety Analysis .......................................................................... 49

    3.6. Validation Activities ......................................................................................... 493.6.1. Validation Safety Analysis ........................................................................... 50

    3.7. Installation Activities......................................................................................... 503.7.1. Operations Manual.................................................................................... 50

    v NUREG/CR-6101

  • 3.7.2. Installation Configuration Tables .................................................................... 503.7.3. Training Manuals ..................................................................................... 503.7.4. Maintenance Manuals................................................................................. 503.7.5. Installation Safety Analysis........................................................................... 50

    3.8. Operations and Maintenance Activities-Change Safety Analysis........................................ 514. Recommendations, Guidelines, and Assessment ................................................................... 53

    4.1. Planning Activities............................................................................................ 534.1.1. Software Project Management Plan.................................................................. 534.1.2. Software Quality Assurance Plan .................................................................... 544.1.3. Software Configuration Management Plan.......................................................... 564.1.4. Software Verification and Validation Plan .......................................................... 594.1.5. Software Safety Plan.................................................................................. 654.1.6. Software Development Plan .......................................................................... 674.1.7. Software Integration Plan............................................................................. 684.1.8. Software Installation Plan................................................................ 694.1.9. Software Maintenance Plan ..................... .............. 7

    4.2. Requirements Activities....................................................................................... 714.2.1. Software Requirements Specification................................................................ 714.2.2. Requirements Safety Analysis ....................................................................... 73

    4.3. Design Activities................................................ ............................................. 744.3.1. Hardware/Software Architecture Specification...................................................... 744.3.2. Software Design Specification ....................................................................... 744.3.3. Design Safety Analysis .......................... .................................................... 75

    4.4. Implementation Activities ................. ............................................ 0..................... 764.4. 1. Code Listings ............................................................................... 0........... 764.4.2. Code Safety Analysis...............0................................................................... 77

    4.5. Integration Activities ............................ ............................................................. 784.5.1. System Build Documents............................................................................. 784.5.2. Integration Safety Analysis........................................................................... 78

    4.6. Validation Activities ................................ ......................................................... 784.6.1. Validation Safety Analysis ........................................................................... 78

    4.7. Installation Activities......................................................................................... 794.7.1. Installation Safety Analysis .......................... ........ 0........................................ 79

    Appendix: Technical Background .......................................................................... 0........... 81A.1. Software Fault Tolerance Techniques....................................................................... 81

    A.1.1. Fault Tolerance and Redundancy.................................................................... 82A.1.2. General Aspects of Recovery....................... ................................................. 82A.1.3. Software Fault Tolerance Techniques............................................................... 84

    A.2. Reliability and Safety Analysis and Modeling Techniques ............................................... 87A.2.1. Reliability Block Diagrams ........... .............................................................. 87A.2.2. Fault Tree Analysis...................................o................................................ 88A.2.3. Event Tree Analysis.................................................................................. 93A.2.4. Failure Modes and Effects Analysis................................................................. 93A.2.5. Markov Models .......................................................................... o............ 95A.2.6. Pet~ri Net Models...................................................................................... 97

    A.3. Reliability Growth Models .... o............................................................................. 101A.M.. Duane Model ....... o................................................................................. 103A.3.2. Musa Model .......................... .............................................................. 103A.3.3. Littlewood Model ................................................................................... 104A.3.4. Musa-Okumoto Model.............................................................................. 104

    References............................................................................................................... 107Standards .....................0..................................................................................... 107Books, Articles, and Reports ..................................................................................... 108

    Bibliography ............................................................................................................. 113

    NUREG/CR-6101 vA

  • Figures

    Figure 2-1. Documents Produced During Each Life Cycle Stage ..................................................... 8Figure 2-2. Waterfall Life Cycle Model ............................................................................... 10Figure 2-3. Spiral Life Cycle Model ................................................................................... I IFigure 3-1. Software Planning Activities .............................................................................. 18Figure 3-2. Outline of a Software Project Management Plan......................................................... 19Figure 3-3. Outline of a Software Quality Assurance Plan............................................................ 22Figure 3-4. Outline of a Software Configuration Management Plan................................................. 24Figure 3-5. Verification and Validation Activities.................................................................... 28Figure 3-6. Outline of a Software Verification and Validation Plan................................................. 30Figure 3-7. Outline of a Software Safety Plan ......................................................................... 30Figure 3-8. Outline of a Software Development Plan................................................................. 34Figure 3-9. Outline of a Software Integration Plan .................................................................... 35Figure 3-10. Outline of a Software Installation Plan.................................................................. 37Figure 3-11. Outline of a Software Maintenance Plain................................................................ 37Figure 3-12. Outline of a Software Requirements Plan............................................................... 39Figure A-1. Reliability Block Diagram of a Simple System.......................................................... 89Figure A-2. Reliability Block Diagram of Single, Duplex, and Triplex Communication Line..................... 89Figure A-3. Reliability Block Diagram of Simple System with Duplexed Communication Line ................. 90Figure A-4. Reliability Block Diagram that Cannot Be Constructed from Serial and Parallel Parts .............. 90Figure A-5. Simple Fault Tree.................................................................................. *....... 90Figure A-6. AND Node Evaluation in a Fault Tree................................................................... 91Figure A-7. OR Node Evaluation in a Fault Tree ..................................................................... 91Figure A-8. Example of a Software Fault Tree........................................................................ 92Figure A-9. Simple Event Tree..................................-* ......... .............. 93Figure A-10. A Simple Markov Model of a System with Three CPUs.............................................. 95Figure A-11. Markov Model of a System with CPUs and Memories................................................ 96Figure A-12. Simple Markov Model with Varying Failure Rates.................................................... 97Figure A-13. Markov Model of a Simple System with Transient Faults ............................................ 97Figure A- 14. An Unmarked Petri Net.................................................................................. 99Figure A-i5. Example of a Marked Petri Net ......................................................................... 99Figure A-16. The Result of Firing Figure A-15..............S....................................................... 100Figure A-17. A Petri Net for the Mutual Exclusion Problem ....................................................... 100Figure A-18. Petri Net for a Railroad Crossing ...................................................................... 101Figure A-19. Execution Time Between Successive Failures of an Actual System ................................ 102

    Tables

    Table 2-1. Persistence Classes and Fault Sources ..................................................................... 12Table A-i. Failure Rate Calculation.................................................................................. 104

    vii vii NTJREG/CR-6 101

  • ABBREVIATIONS AND ACRONYMS

    ANSI American National Standards InstituteCASE Computer-Assisted Software EngineeringCCB Configuration Control BoardCI Configuration ItemCM Configuration ManagementCPU Central Processing UnitETA Event Tree AnalysisFBD Functional Block DiagramFLBS Functional Level Breakdown StructureFMEA Failure Modes and Effects AnalysisFMECA Failure Modes, Effects and Criticality AnalysisFTA Fault Tree AnalysisI&C Instrumentation and Control1/0 Input/OutputIEEE Institute of Electrical and Electronic EngineersMTTF Mean Time To FailurePDP Previously Developed or PurchasedPERT Program Education and Review TechniqueQA Quality AssuranceRAM Random Access MemoryROM Read Only MemorySCM Software Configuration ManagementSCMP Software Configuration Management PlanSPMP Software Project Management PlanSQA Software Quality AssuranceSQAP Software Quality Assurance PlanSRS Software Requirements SpecificationSSP Software Safety PlanTMR Triple Modular RedundancyUCLA University of California at Los AngelesUPS Uninterruptable Power SupplyV&V Verification and ValidationWBS Work Breakdown Structure

    NUREG/CR-6101 viii

  • EXECUTIVE SUMMARY

    The development, use, and regulation of computer systems in nuclear reactor protection systems to enhancereliability and safety is a complex issue. This report is one of a series of reports from the Computer Safety andReliability Group, Lawrence Livermore National Laboratory, which investigates different aspects of computersoftware in reactor protection systems.

    There are two central themes in this report. First, software considerations cannot be fully understood in isolationfrom computer hardware and application considerations. Second, the process of engineering reliability and safetyinto a computer system requires activities to be carried out throughout the software life cycle. These two themesaffect both the structure and the content of this report.

    Reliability and safety are concerned with faults, errors, and failures. A fault is a triggering event that causesthings to go wrong; a software bug is an example. The fault may cause a change of state in the computer, which istermed an error. The error remains latent until the incorrect state is used; it then is termed effective. It may thencause an externally-visiblefailure. Only the failure is visible outside the computer system. Preventing or correctingthe failure can be done at any of the levels: preventing or correcting the causative fault, preventing the fault fromcausing an error, preventing the error from causing a failure, or preventing the failure from causing damage. Thetechniques for achieving these goals are termed fault prevention,fault correction, and fault tolerance.

    Reliability and safety are related, but not identical, concepts. Reliability, as defined in this report, is a measureof how long a system will run without failure of any kind, while safety is a measure of how long a system will runwithout catastrophic failure. Thus safety is directly concerned with the consequences of failure, not merely theexistence of failure. As a result, safety is a system issue, not simply a software issue, and must be analyzed anddiscussed as a property of the entire reactor protection system.

    Faults and failures can be classified in several different ways. Faults can be described as design faults,operational faults, or transient faults. All software faults are design faults; however, hardware faults may occur inany of the three classes. This is important in a safety-related system since the software may be required tocompensate for the operational faults of the hardware. Faults can also be classified by the source of the fault;software and hardware are two of the possible sources discussed in the report. Others are: input data, system state,system topology, people, environment, and unknown. For example, the source of many transient faults is unknown.

    Failures are classified by mode and scope. A failure mode may be sudden or gradual; partial or complete. Allfour combinations of these are possible. The scope of a failure describes the extent within the system of the effectsof the failure. This may range from an internal failure, whose effect is confined to a single small portion of thesystem, to a pervasive failure, which affects much of the system.

    Many different life cycle models exist for developing software systems. These differ in the timing of the variousactivities that must be done in order to produce a high-quality software product, but the actual activities must bedone in any case. No particular life cycle is recommended here, but there are extensive comments on the activitiesthat must be carried out. These have been divided into eight categories, termed sets of activities in the report. Thesesets are used merely to group related activities; there is no implication that the activities in any one set must be allcarried out at the same time, or that activities in "later" sets must follow those of "earlier" sets. The eight categoriesare as follows:* Planning activities result in the creation of a number of documents that are used to control the development

    process. Eleven are recommended here: a Software Project Management Plan, a Software Quality AssurancePlan, a Software Configuration Management (CM) Plan, a Software Verification and Validation (V&V) Plan, aSoftware Safety Plan, a Software Development Plan, a Software Integration Plan, a Software Installation Plan, aSoftware Maintenance Plan, a Software Training Plan, and a Software Operations Plan. Many of these plans arediscussed in detail, relying on various ANSI/IEEE standards when these exist for the individual plans.

    * The second set of activities relate to documenting the requirements for the software system. Four documents arerecommended: the Software Requirements Specification, a Requirements Safety Analysis, a V&V

    ix NUREO/CR-6101

  • Requirements Analysis, and a CM Requirements Report. These documents will fully capture all therequirements of the software project, and relate these requirements to the overall protection system functionalrequirements and protection system safety requirements.

    The design activities include five recommended documents. The Hardware and Software Architecture willdescribe the computer system design at a fairly high level, giving hardware devices and mapping softwareactivities to those devices. The Software Design Specification provides the complete design on the softwareproducts. Design analyses include the Design Safety Analysis, the V&V Design Analysis, and the CM DesignReport.

    * Implementation activities include writing and analyzing the actual code, using some programming language.Documents include the actual code listings, the Code Safety Analysis, the V&V Implementation Analysis andTest Report, and the CM Implementation Report.

    " Integration activities are those activities that bring software, hardware, and instrumentation together to form acomplete computer system. Documents include the System Build Documents, the Integration Safety Analysis,the V&V Integration Analysis and Test Report, and the CM Integration Report.

    " Validation is the process of ensuring that the final complete computer system achieves the original goals thatwere imposed by the protection system design. The final system is matched against the original requirements,and the protection system safety analysis. Documents include the Validation Safety Analysis, the V&VValidation and Test Report, and the CM Validation Report.

    " Installation is the process of moving the completed computer system from the developer's site to the operationalenvironment, within the actual reactor protection system. The completion of installation provides the operatorwith a documented operational computer system. Seven documents are recommended: the Operations Manual,the Installation Configuration Tables, Training Manuals, Maintenance Manuals, an Installation Safety Analysis,a V&V Installation Analysis and Test Report, and a CM Installation Report-

    * The operations and maintenance activities involve the actual use of the computer system in the operatingreactor, and making any required changes to it. Changes may be required due to errors in the system that werenot found during the development process, changes to hardware or requirements for additional functionality.Safety analyses, V&V analyses, and CM activities are all recommended as part of the maintenance process.

    Three general methods exist that may be used to achieve software fault tolerance; n-version programming,recovery block, and exception handling. Each of these attempts to achieve fault tolerance by using more than onealgorithm or program module to perform a calculation, with some means of selecting the preferred result. In n-version programming, three or more program modules that implement the same function are executed in parallel,and voting is used to select the "correct" one. In recovery block, two or more modules are executed in series, with anacceptance algorithm used after each module is executed to decide if the result should be accepted or the nextmodule executed. In exception handling, a single module is executed, with corrections made when exceptions aredetected. Serious questions exist as to the applicability of the n-version programming and the recovery-blocktechniques to reactor protection systems, because of the assumptions underlying the techniques, the possibility ofcommon-mode failures in the voting or decision programs, and the cost and time of implementing them.

    One means of assessing system reliability or safety is to create a mathematical model of the system and analyzethe properties of that model. This can be very effective providing that the model captures all the relevant factors ofthe reality. Reliability models have been used for many years for electronic and mechanical systems. The use ofreliability models for software is fairly new, and their effectiveness has not yet been fully demonstrated. Fault treemodels, event tree models, failure modes and effects analysis, Markov models, and Petri net models all havepossibilities. Of particular interest are reliability growth models, since software bugs tend to be corrected as they arefound. Reliability Growth models can be very useful in understanding the growth of reliability through a testingactivity, but cannot be used alone to justify software for use in a safety-related application, since such applicationsrequire a much higher level of reliability than can be convincingly demonstrated during a test-correct-test activity.

    NUREG/CR-6101 X

  • Software Reliability andSafety in Nuclear Reactor

    Protection Systems

    1. INTRODUCTION

    1.1. Purpose

    Reliability and safety are related, but not identical,concepts. Reliability can be thought of as theprobability that a system fails in any way whatever,while safety is concerned with the consequences offailure. Both are important in reactor protectionsystems. When a protection system is controlled by acomputer, the impact of the computer system onreliability and safety must be considered in the reactordesign. Because software is an integral part of acomputer system, software reliability and softwaresafety become a matter of concern to the organizationsthat develop software for protection systems and to thegovernment agencies that regulate the developers. Thisreport is oriented toward the assessment process. Theviewpoint is from that of a person who is assessing thereliability and safety of a computer software systemthat is intended to be used in a reactor protectionsystem.

    1.2.. Scope

    Software is only one portion of a computer system.The other portions are the computer hardware and theinstrumentation (sensors and actuators) to which thecomputer is connected. The combination of software,hardware, and instrumentation is frequently referred toas the Instrumentation and Control (I&C) System.Nuclear reactors have at least two I&C systems--onecontrols the reactor operation, and the other controlsthe reactor protection. The latter, termed the ProtectionComputer System, is the subject of this report.

    This report assumes that the computer system as awhole, as well as the hardware and instrumentationsubsystems, will be subject to careful development,analysis, and assessment in a manner similar to thatgiven here for the software. That is, it is assumed that

    there will be appropriate plans, requirements anddesign specifications, procurement and installation,testing and analysis for the complete computer system,as well as the hardware, software, and instrumentationsubsystems. The complete computer system and thehardware and instrumentation subsystems arediscussed here only as they relate to the softwaresubsystem.

    The report is specifically directed toward enhancingthe reliability and safety of computer controlled reactorprotection systems. Almost anything can affect safety,so it is difficult to bound the contents of the report.Consequently material is included that may seemtangential to the topic. In these cases the focus is onreliability and safety; other aspects of such material aresummarized or ignored. More complete discussions ofthese secondary issues may be found in the references.

    This report is one of a series of reports prepared by theComputer Safety and Reliability Group, FissionEnergy and System Safety Program, LawrenceLivermore National Laboratory. Aspects of softwarereliability and safety engineering that are covered inthe other reports are treated briefly in this report, if atall. The reader is referred to the following additionalreports:

    1 . Robert Barter and Lin Zucconi, "Verification andValidation Techniques and Auditing Criteria forCritical System-Control Software," Lawrence.Livermore National Laboratory, Livermore, CA(February 1993).

    2. George G. Preckshot, "Real-Time SystemsComplexity and Scalability," Lawrence LivermoreNational Laboratory, Livermore, CA (August1992).

    I 1 N-UREG/CR-6101

  • Section 1. Introduction

    3. George G. Preckshol: and Robert H. Wyman,"Communications Systems in Nuclear PowerPlants," Lawrence Livermore National Laboratory,Livermore, CA (August 1992).

    4. George G. Preckshot, "Real-Time Performance,"Lawrence Livermore National Laboratory,Livermore, CA (November 1992).

    5. Debra Sparkman, "Techniques, Processes, andMeasures for Software Safety and Reliability,"Lawrence Livermore National Laboratory,Livermore, CA (April 1992).

    6. Lloyd G. Williams, "Formal Methods in theDevelopment of Safety Critical SoftwareSystems," SERM-014-91, Software EngineeringResearch, Boulder, CO (April 1992).

    7. Lloyd G. Williams, "Assessment of FormalSpecifications for Safety-Critical Systems,"Software Engineering Research, Boulder, CO(February 1993).

    8. Lloyd G. Williams, "Considerations for the Use ofFormal Methods in Software-Based SafetySystems," Software Engineering Research,Boulder, CO (February 1993).

    9. Lin Zucconi and Booker Thomas, "TestingExisting Software for Safety-RelatedApplications," Lawrence Livermore NationalLaboratory, Livermore, CA (January 1993).

    1.3. Report OrganizationSection 2 contains background on several topicsrelating to software reliability and software safety.Terms are defined, life cycle models are discussedbriefly, and two classification schemes are presented.

    Section 3 provides detail on the many life cycleactivities that can be done to improve reliability andsafety. Development activities are divided into eight

    sets of activities: planning, requirements specification,design specification, software implementation,integration with hardware and instrumentation,validation, installation and operations, andmaintenance. Each set of activities includes a numberof tasks that can be undertaken to enhance reliabilityand safety. Because the report is oriented towardsassessment, the tasks are discussed in terms of thedocuments they produce and the actions necessary tocreate the document contents.

    Section 4 discusses specific motivations,recommendations, guidelines, and assessmentquestions. The motivation sections describe particularconcerns of the assessor when examining the safety ofsoftware in a reactor protection system.Recommendations consist of actions the developershould or should not do in order to address suchconcerns. Guidelines consist of s uggestions that areconsidered good engineering practice when developingsoftware. Finally, the assessment sections consist oflists of questions that the assessor may use to guide theassessment of a particular aspect of the softwaresystem.

    From the viewpoint of the assessor, softwaredevelopment consists of the organization that does thedevelopment, the process used in the development, andthe products of that development. Each is subject toanalysis, assessment and judgment. This reportdiscusses all three aspects in various places within theframework of the life cycle. Process and product arethe primary emphasis.

    Following the main body of the report, the appendixprovides information on software fault tolerancetechniques and software reliability models. Abibliography of information relating to softwarereliability and safety is also included.

    NUREG/CR-61012 2

  • Section 2. Terminology

    2. TERMINOLOGY

    This section includes discussions of the basicterminology used in the remainder of the report. Thesection begins with a description of the terms used todescribe systems. Section 2.2 provides carefuldefinitions of the basic terminology for reliability andsafety. Section 2.3 contains brief descriptions ofseveral of the life cycle models commonly used insoftware development, and defines the variousactivities that must be carried out during any softwaredevelopment project. Section 2.4 describes variousclassification schemes for failures and faults, andprovides the terms used in these schemes. Finally,Section 2.5 discusses the terms used to describesoftware qualities that are used in following sections.

    2.1. Systems Terminology

    The word system is used in many different ways incomputer science. The basic definition, given in IEEEStandard 610.12, is "a collection of componentsorganized to accomplish a specific function or set offunctions." In the context of a nuclear reactor, the wordcould mean, depending on context, the society usingthe reactor, the entire reactor itself, the portion devotedto protection, the computer hardware and softwareresponsible for protection, or just the software.

    In this report the term system, without modifiers, willconsistently refer to the complete application withwhich the computer is directly concerned. Thus a"system" should generally be understood as a "reactorprotection system." When portions of the protectionsystem are meant, and the meaning isn't clear fromcontext, a modifier will be used. Reference could bemade to the computer system (a portion of theprotection system), the software system (in thecomputer system), the hardware system (in thecomputer system) and so forth. In some cases, the term"application system" is used to emphasize that theentire reactor protection system is meant.

    A computer system is itself composed of subsystems.These include the computer hardware, the computersoftware, operators who are using the computersystem, and the instruments to which the computer isconnected. The definition of instrument is taken fromANSI/ISA Standard S5.1: "a device used directly orindirectly to measure and/or control a variable. Theterm includes primary elements, final control elements,computing devices and electrical devices such asannunciators, switches, and pushbuttons. The term

    does not apply to parts that are internal components ofan instrument."

    Since this report is concerned with computer systemsin general, and software systems in particular,instruments are restricted to those that interact with thecomputer system. There are two types: sensors andactuators. Sensors provide information to the softwareon the state of the reactor, and actuators providecommands to the rest of the reactor protection systemfrom the software.

    2.2. Software Reliability and SafetyTerminology

    2.2.1. Faults, Errors, and Failures

    The wordsfault, error, and failure have a plethora ofdefinitions in the literature. This report uses thefollowing definitions, specialized to computer systems(Laprie 1985; Randell 1978; Siewiorek 1982).

    A fault is a deviation of the behavior of a computersystem from the authoritative specification of itsbehavior. A hardware fault is a physical change inhardware that causes the computer system to change itsbehavior in an undesirable way. A software fault is amistake (also called a bug) in the code. A user faultconsists of a mistake by a person in carrying out someprocedure. An environmental fault is a deviation fromexpected behavior of the world outside the computersystem; electric power interruption is an example. Theclassification of faults is discussed further inSubsection 2.4.1.

    An error is an incorrect state of hardware, software, ordata resulting from a fault. An error is, therefore, thatpart of the computer system state that is liable to leadto failure. Upon occurrence, a fault creates a latenterror, which becomes effective when it is activated,leading to a failure. If never activated, the latent errornever becomes effective and no failure occurs.

    A failure is the external manifestation of an error. Thatis, a failure is the external effect of the error, as seen bya (human or physical device) user, or by anotherprogram.

    Some examples may clarify the differences among thethree terms. A fault may occur in a circuit (a wirebreaks) causing a bit in memory to always be a 1 (an

    3 NUREG/CR-6101

  • Section 2. Terminology

    error, since memory is part of the state) resulting in afailed calculation.

    A programmer's mistake is a fault; the consequence isa latent error in the written software (erroneousinstruction). Upon activation of the module where theerror resides, the error becomes effective. If thiseffective error causes a divide by zero, a failure occursand the program aborts.

    A maintenance or operating manual writer's mistake isa fault; the consequence is an error in thecorresponding manual, which will remain latent aslong as the directives are not acted upon.

    The view summarized here enables fault pathology tobe made precise. The creation and action mechanismsof faults, errors and failures may be summarized asfollows.

    I. A fault creates one or more latent errors in-thecomputer system component where it occurs.Physical faults can directly affect only the physicallayer components, whereas other types of faultsmay affect any component.

    2. There is always a time delay between theoccurrence of a fault and the occurrence of theresulting latent error(s). This may be measured innanoseconds or years, depending on the situation.Some faults may not cause errors at all; forexample, a bug in a portion of a program that isnever executed. It is convenient to consider this tobe an extreme case in which an infinite amount oftime elapses between fault and latent error.

    3. The properties governing errors may be stated asfollows:

    a. A latent error becomes effective once it isactivated.

    b. An error may cycle between its latent andeffective states.

    c. An effective error may, and in general does,propagate from one component to another. Bypropagating, an error creates other (new)errors.

    From these properties it may be deduced that aneffective error within a component may originatefrom:

    Activation of a latent error within the samecomponent.

    * An effective error propagating within thesame component or from another component.

    4. A component failure occurs when an error affectsthe service delivered (as a response to requests) bythe component. There is always a time delaybetween the occurrence of the error and theoccurrence of the resulting failure. This may varyfrom nanoseconds to infinity (if the failure neveractually occurs).

    5. These properties apply to any component of thecomputer system. In a hierarchical system, failuresat one level can usefully be thought of as faults bythe next higher level.

    Most reliability, availability, and safety analysis andmodeling assume that each fault causes at most a singlefailure. That is, failures are statistically independent.This is not always true. A common-mode failure occurswhen multiple components of a computer system faildue to a single fault. If common mode failures dooccur, an analysis that assumes that they do not will beexcessively optimistic. There are a number of reasonsfor common mode failures (Dhillon 1983):

    " Environmental causes, such as dirt, temperature,moisture, and vibrations.

    " Equipment failure that results from an unexpectedexternal event, such as fire, flood, earthquake, ortornadoes.

    " Design deficiencies, where some failures were notanticipated during design. An example is multipletelephone circuits routed through a singleequipment box. Software design errors, whereidentical software is being run on multiplecomputers, is of particular concern in this report.

    " Operational errors, due to factors such as impropermaintenance procedures, carelessness, or impropercalibration of equipment.

    " Multiple items purchased from the same vendor,where all of the items have the samemanufacturing defect.

    " Common power supply used for redundant units.

    • Functional deficiencies, such as misunderstandingof process variable behavior, inadequatelydesigned protective actions, or inappropriateinstrumentation.

    2.2.2. Reliability and Safety Measures

    Reliability and safety measurements are inherentlystatistical, so the fundamental quantities are defined

    NUREG/CR-6101 4

  • Section 2. Terminology

    statistically. The four basic terms are reliability,availability, maintainability, and safety. These andother related terms are defined in the following text.Note that the final three definitions are qualitative, notquantitative (Siewiorek 1982; Smith 1972). Most ofthese definitions apply to arbitrary systems. Theexception is safety; since this concept is concernedwith the consequences of failure, rather than the simplefact of failure, the definition applies only to a systemthat can have major impacts on people or equipment.More specifically, safety applies to reactors, not tocomponents of a reactor.

    The reliability, R(t), of a system is theconditional probability that the system hassurvived the interval [0, t], given that it wasoperating at time 0. Reliability is often given interms of the failure rate (also referred to as thehazard rate ), ;(t), or the mean time to failure,mttfr. If the failure rate is constant,mttf = 1 / X. Reliability is a measure of thesuccess with which the system conforms to someauthoritative specification of its behavior, andcannot be measured without.such a specification.

    * The availability, A(t), of a system is theprobability that the system is operational at theinstant of time t. For nonrepairable systems,availability and reliability are equal. For repairablesystems, they are not. As a general rule,0:5- R(t)

  • Section 2. Terminology

    It is also useful to consider the word "critical" whenused to describe systems. A critical system is a systemwhose failure may have very unpleasant consequences(mishaps). The results of failure may affect thedevelopers of the system, its direct users, theircustomers or the general public. The consequencesmay involve loss of life or property, financial loss,legal liability (such as jail), regulatory threats, or eventhe loss of good will (if that is extremely important).The term safety critical refers to a system whosefailure could cause an accident.

    A good brief discussion of accidents is found inLeveson 1991:

    Despite the usual oversimplification of thecauses of particular accidents ("humanerror" is often the identified culprit despitethe all-encompassing nature and relativeuselessness of such a categorization),accidents are caused almost without exceptionby multiple factors, and the relativecontribution of each is usually not clear. Anaccident may be thought of as a set of eventscombining together in random fashion or,alternatively, as a dynamic mechanism thatbegins with the activation of a hazard andflows through the system as a series ofsequential and concurrent events in a logicalsequence until the system is out of control anda loss is produced (the "domino theory").Either way, major incidents often have morethan one single cause, and it is usually ,difficult to place blame on any one event orcomponent of the system. The high frequencyof complex, multifactorial accidents may arisefrom the fact that the simpler potentials havebeen anticipated and handled. But the verycomplexity of events leading to an accidentimplies that there may be many opportunitiesto intervene or interrupt the sequence.

    A second characteristic of accidents is thatthey often involve problems in subsysteminterfaces. It appears to be easier to deal with

    failures of components than failures in theinterfaces between components. This shouldnot be a surprise to software engineers,consider the large number of operationalsoftware faults that can be traced back t&requirements problems. The softwarerequirements are the specific representation

    of the interface between the software and theprocesses or devices being controlled.

    A third important characteristic claimed foraccidents is that they are intimatelyintertwined with complexity and coupling.Perrow has argued that accidents are"normal" in complex and tightly coupledsystems. Unless great care is taken, theaddition of computers to control these systemsis likely to increase both complexity andcoupling, which will increase the potentialforaccidents.

    2.3. Life Cycle Models

    Many different software life cycles have beenproposed. These have different motivations, strengths,and weaknesses. The life cycle models generallyrequire the same types of tasks to be carried out; theydiffer in the ordering of these tasks in time. Noparticular life cycle is assumed here. There is anassumption that the activities that occur during thedeveloper's life cycle yield the products indicated in.Figure 2-1. Each of the life cycle activities producesone or more products, mostly documents, that can beassessed. The development process itself is subject toassessment.

    The ultimate result of software development, asconsidered in this report, is a suite of computerprograms that run on computers and control the reactorprotection system. These programs will havecharacteristics deemed desirable by the developer orcustomer, such as reliability, performance, usability,and functionality. This report is only concerned withreliability and safety; however, that concern does "spillover" into other qualities.

    The development model used here suggests one ormore audits of the products of each set of life cycleactivities. The number of audits depends, among otherthings, on the specific life cycle model used by thedeveloper. The audit will assess the work done thatrelates to the set of activities being audited. Manyreliability, performance, and safety problems can beresolved only by careful design of the softwareproduct, so must be addressed early in the life cycle, nomatter which life cycle is used. Any errors oroversights can require difficult and expensive retrofits,so are best found as early as possible. Consequently, anincremental audit process is believed to be more costeffective than a single audit at the end of the

    NUREG/CR-6101 6

  • Section 2. Terminology

    development process. In this way, problems can bedetected early in the life cycle and corrected beforelarge amounts of resources have been wasted.

    Three of the many life cycle models are describedbriefly in subsections 2.3.1. through 2.3.3. Noparticular life cycle model is advocated. Instead, amodel should be chosen to fit the style of thedevelopment organization and the nature of theproblem being solved.

    2.3.1. Waterfall Model

    The classic waterfall model of software developmentassumes that each phase of the life cycle can becompleted before the next phase is begun (Pressman1987). This is illustrated in Figure 2-2. The actualphases of the waterfall model differ among the variousauthors who discuss-the model; the figure showsphases appropriate to reactor protection systems. Notethat the model permits the developer to return toprevious phases. However, this is considered to be anexceptional condition to the normal forward flow,included to permit errors in previous stages to becorrected. For example, if a requirements error isdiscovered during the implementation phase, thedeveloper is expected to halt work, return to therequirements phase, fix the problem, change the designaccordingly, and then restart the implementation fromthe revised design. In practice, one only stops theimplementation affected by the newly discoveredrequirement

    The waterfall model has been severely criticized as notbeing realistic to many software developmentsituations, and this is frequently justified. It remains anexcellent model for those situations where therequirements are known and stable before developmentbegins, and where little change to requirements isanticipated.

    2.3.2. Phased Implementation Model

    This model assumes that the development will takceplace as a sequence of versions, with a release aftereach version is completed. Each version has its ownlife cycle model. If new requirements are generatedduring the development of a version, they willgenerally be delayed until the next version, so awaterfall model may be appropriate to each version.(Marketing pressures may modify such delays.)

    This model is appropriate to commercial products thatare evolving over long periods of time, or for which

    external requirements change slowly. Operatingsystems and language compilers are examples.

    2.3.3. Spiral Model

    The spiral model was developed at TRW (Boehm1988) in an attempt to solve some of the perceiveddifficulties with earlier models. This model assumesthat software development can be modeled as asequence of activities, as shown in Figure 2-3. Eachtime around the spiral (phase), the product is developedto a more complete degree. Four broad steps arerequired:

    1 . Determine the objectives for the phase. Consideralternatives to meeting the objectives.

    2. Evaluate the alternatives. Identify risks tocompleting the phase, and perform a risk analysis.Make a decision to proceed or stop.

    3. Develop the product for the particular phase.

    4. Plan for the next phase.

    The products for each phase may match those of theprevious models. In such circumstances, the first looparound the spiral results in a concept of operations; thenext, a requirements specification; the next, a design;and so forth. Alternately, each loop may contain acomplete development cycle for one phase of theproduct; here, the spiral model looks somewhat like thephased implementation model. Other possibilities exist.

    The spiral model is particularly appropriate whenconsiderable financial, schedule, or technical risk isinvolved in the product development. Thbis is becausean explicit risk analysis is carried out as part of eachphase, with an explicit decision to continue or stop.

    2.4. Fault and Failure ClassificationSchemes

    Faults and failures can be classified in several differentways. Those that are considered useful in safety-relatedapplications are described briefly here. Faults areclassified by persistence and by the source of the fault.There is some interaction between these, in the sensethat not all persistence classes may occur for allsources. Table 2-1 provides the interrelationship.

    Failures are classified by mode, scope, and the effecton safety. These classification schemes consider theeffect of a failure, both on the environment withinwhich the computer system operates, and on thecomponents of the system.

    7 7 NUREG/CR-6101

  • Section 2. Terminology

    Software Developer Activities

    C.)- Planning Re quirements Design ImplementationLActivitties Activities Activities

    I

    Software Audit Activities

    Figure 2-1. Documents Produced During Each Life Cycle Stage

    NIJREG/CR-61018 8

  • Section 2. Terminology

    Software Developer Activities

    .20i ii Ii ii>,= peaio ItgrtinValidation Installation Opeationteac

    InteActtinitiesActivities Activities Activities Atvte

    Software Audit Activities

    Figure 2-1. Documents Produced During Each Life Cycle Stage (continued)

    9 NUREG/,CR-6101

  • Section 2. Terminology

    Figure 2-2. Waterfall Life Cycle Model

    NUREG/CR-6101 10

  • Section 2. Terminology

    Cumulativecost

    Evaluate alternatives,identify, resolve risksDetermine

    objectives,alternatives,constraints

    analysis . Operational

    . Prototype 1 Prototype• Prototype 3 prototype

    partition IRequ-ý_ L

    -- - - - Simulations, models, benchmarksirements plan Cneto

    Inegrplan operation vld t waonmequiremenm t software Detailed

    product design

    Fu plan validationfe CCdee

    Integration Design validation Itest

    pandts and verification IIntegration I• "• "-" '"•Acceptance 1 and test

    I I test I

    Implementation I Develop, verity

    next-level product

    (Boehm 1988)

    Figure 2-3. Spiral Life Cycle Model

    Plan next phases

    11 11 NUREG/CR-6101

  • Section 2. Terminology

    Table 2-1. Persistence Classes and Fault Sources

    Design Operational Transient

    Hardware component X X X

    Software component X

    Input data X X

    Permanent state X X

    Temporary state X X

    Topological X

    Operator X X X

    User X X X

    Environmental X X X

    Unknown X

    2.4.1. Fault Classifications

    Faults and failures can be classified by several more-or-less orthogonal measures. This is important, becausethe classification may affect the depth and method ofanalysis and problem resolution, as well as thepreferred modeling technique.

    Faults can be classified by the persistence and sourceof the fault. This is described in the two subsections ofthis section. Terms defined in each subsection are usedin other subsections.

    2.4.1.1. Fault Persistence

    Any fault falls into one of the following three classes(Kopetz 1985):

    A design fault is a fault that can be corrected byredesign. Most software and topological faults fallinto this class, but relatively few hardware faultsdo. Design faults are sometimes called removablefaults, and are generally modeled by reliabilitygrowth models (See Appendix A.3.). One designfault can cause many errors and failures before itis diagnosed and corrected. Design faults are

    usually quite expensive to correct if they are notdiscovered until the product is in operation.

    An operational fault is a fault where some portionof the computer system breaks and must berepaired in order to return the system to a state thatmeets the design specifications. Examples includeelectronic and mechanical faults, databasecorruption, and some operator faults. Operationalfaults are sometimes called non-removable faults.When calculating fault rates for operational faults,it is generally assumed that the entity that hasfailed is in the steady-state portion of its life, sooperational fault rates are constant. As with designfaults, an operational fault may cause many errorsbefore being identified and repaired.

    " A transient fault is a fault that does cause acomputer system failure, but is no longer presentwhen the system is restarted. Frequently the basiccause of a transient fault cannot be determined.Redesign or repair has no effect in this case,although redesign can affect the frequency oftransient faults. Examples include power supplynoise and operating system timing errors. While anunderlying problem may actually exist, no actionis taken to correct it (or the fault would fall into

    NUREG/CR-6101 12

  • Section 2. Terminology

    one of the other classes). In some computersystems, 50-80% of all faults are transient. Thefrequency of operating system faults, for example,is typically dependent on system load andcomposition.

    The class of transient faults actually includes twodifferent types of event; they are grouped together heresince it is generally impossible to distinguish betweenthem. Some events are truly transient; a classic (thoughspeculative) example is a cosmic ray that flips a singlememory bit. The other type is an event that really is adesign or operational fault, but this is not known whenit occurs. That is, it looks like the first type of transientevent. If the cause is never discovered, no real harm isdone in placing it in this class. However, if the cause iseventually determined, the event should be classifiedproperly; this may well require recalculation ofreliability measures..

    A computer system is constructed according to somespecification. If the system fails, but still meets *thespecification, then the specification was wrong. This isa design fault. If, however, the system ceases to meetthe specification and fails, then the underlying fault isan operational fault. A broken wire is an example. Ifthe specification is correct, but the system failsmomentarily and then recovers on its own, the fault istransient.

    Many electronic systems, and some mechanicalsystems, have a three stage life cycle with respect tofault persistence. When the device is first constructed,it will have a fairly high fault rate due to undetecteddesign faults and "burn-in" operational faults. Thisfault rate decreases for a period of time, after that thedevice enters its normal life period. During this(hopefully quite long) period, the failure rate isapproximately constant, and is due primarily tooperational and transient faults, with perhaps a fewremaining design faults. Eventually the device beginsto wear out, and enters the terminal stage of its life.Here the fault rate increases rapidly as the probabilityof an operational fault goes up at an increasing rate. Itshould be noted that in many cases the end of theproduct's useful life is defined by this increase in thefault rate.

    The behavior described in the last paragraph results ina failure rate curve termed the "bathtub" curve. It wasoriginally designed to model electronic failure rates.There is a somewhat analogous situation for software.When a software product is first released, there may bemany failures in the field for some period of time. As

    the underlying faults are corrected and new releases aresent to the customers, the failure rate should decreaseuntil a more-or-less steady state is reached. Over time,the maintenance and enhancement process may perturbthe software structure sufficiently that new faults areintroduced faster than old ones are removed. Thefailure rate may then go up, and a complete redesign isin order.

    While this behavior looks similar to that described forelectronic systems, the causal factors are quitedifferent. One should be very careful when attemptingto extrapolate from one to the other.

    2.4.1.2. Source of Faults in Computer Systems

    Fault sources can be classified into a number ofcategories; ten are given here. For each one, the sourceis described briefly, and the types of persistence thatare possible is discussed.

    * A hardware fault is a fault in a hardwarecomponent, and can be of any of the threepersistence types. Application systems rarelyencounter hardware design faults. Transienthardware faults are very frequent in some systems.

    * A software fault is a bug in a program. In theory,all such are design faults. Dhillon (1987) classifiessoftware faults into the following eight categories:

    - Logic faults- Interface faults- Data definition faults- Database faults- Input/output faults- Computational faults- Data handling faults- Miscellaneous faults

    " An input data fault is a mistake in the input. Itcould be a design fault (connecting a sensor to thewrong device is an example) or an operationalfault (if a user supplies the wrong data).

    " A permanent state fault is a fault in state data thatis recorded on non-volatile storage media (such asdisk). Both design and operational faults arepossible. The use of a data structure definition thatdoes not accurately reflect the relationships amongthe data items is an example of a design fault. Thefailure of a program might cause an erroneousvalue to be stored in a file, causing an operationalfault in the file.

    13 13 NUREG/CR-6101

  • Section 2. Terminology

    " A temporary state fault is a fault in state data thatis recorded on volatile media (such as mainmemory). Both design and operational faults arepossible. The primary reason to separate this frompermanent state faults is to allow for thepossibility of different failure rates.

    " A topologicalfault is a fault caused by a mistakein computer system architecture, not with thecomponent parts. All such faults are design faults.Notice that the failure of a cable is considered ahardware operational fault, not a topological fault.

    " An operator fault is a mistake by the operator.Any of the three types are possible. A design faultoccurs if the instructions provided to the operatorare incorrect; this is sometimes called a procedurefault. An operational fault would occur if theinstructions are correct, but the operatormisunderstands and doesn't follow them. Atransient fault would occur if the operator isattempting to follow the instructions, but makes anunintended mistake. Hitting the wrong key on akeyboard is an example. (One goal of displayscreen design is to reduce the probability oftransient operator errors.)

    A user fault differs from an operator fault onlybecause of the different type of person involved;operators and users can be expected to havedifferent fault rates.

    An environmentalfault is a fault that occursoutside the boundary of the computer system, butthat affects the system. Any of the three types ispossible. Failure to provide an uninterruptiblepower supply (UPS) would be a design fault,while failure of the UPS would be an operationalfault. A voltage spike on a power line is anexample of an environmentally induced transientfault.

    An unknown fault is any fault whose source classis never identified. Unfortunately, in somecomputer systems many faults occur whose sourcecannot be identified. All such faults are transient(more or less by definition), and this category maywell include a plurality of system faults. Anotherproblem is that the underlying problem may beidentified at a later time (possibly months later), sothere is a certain impermanence about thiscategory. It generally happens that someinformation is available about the source of thefault, but not sufficient information to allow thesource to be completely identified. For example, it

    might only be known that there is a fault in acommunication system.

    Table 2-1 shows which persistence classes may occurfor each of the ten fault sources.

    2.4.2. Failure Classifications

    Three aspects of classifying failures are given below;there are others. These are particularly relevant to laterdiscussion in this report.

    2.4.2.1. Failure Modes

    Different failure modes can have different effects on acomputer system. The following definitions apply(Smith 1972).

    A sudden failure is a failure that could not beanticipated by prior examination. That is, thefailure is unexpected.

    A gradual failure is a failure that could beanticipated by prior examination. That is, thesystem goes into a period of degraded operationbefore the failure actually occurs.

    * A partial failure is a failure resulting in deviationsin characteristics beyond specified limits but notsuch as to cause complete lack of the requiredfunction.

    * A complete failure is a failure resulting indeviations in characteristics beyond specifiedlimits such as to cause complete lack of therequired function. The limits referred to in thiscategory are special limits specified for thispurpose.

    " A catastrophic failure is a failure that is bothsudden and complete.

    " A degradation failure is a failure that is bothgradual and partial.

    2.4.2.2. The Scope of Failures

    Failures can be assigned to one of three classes, dependingon the scope of their effects (Anderson 1983).

    " A failure is internal if it can be adequately handledby the device or process in which the failure isdetected.

    " A failure is limited if it is not internal, but if theeffects are limited to that device or process.

    * A failure is pervasive if it results in failures ofother devices or processes.

    NUREG/CR-6101 14

  • Section 2. Term-inology

    2.4.2.3. The Effects of Failures on Safety

    Finally, it is possible to classify application systems by

    the effect of failures on safety.

    A system is intrinsically safe if the system has nohazardous states.

    * A system is termed fail safe if a hazardous statemay be entered, but the system will prevent anaccident from resulting from the hazard. Anexample would be a facility in a reactor that forcesa controlled shutdown in case a hazardous state isentered, so that no radiation escapes.

    * A system controls accidents if a hazardous statemay be entered and an accident may occur, but thesystem will mitigate the consequences of theaccident. An example is the containment shell of areactor, designed to preclude a radiation releaseinto the environment if an accident did occur.

    " A system gives warning of hazards if a failure mayresult in a hazardous state, but the system issues awarning that allows trained personnel to applyprocedures outside the system to recover from thehazard or mitigate the accident. For example, areactor computer protection system might notifythe operator that a hazardous state has beenentered, permitting the operator to "hit the panicbutton" and force a shutdown in such a way thatthe computer system is not involved.

    " Finally, a system is fail dangerous, or creates anuncontrolled hazard, if system failure can cause anuncontrolled accident.

    2.5. Software QualitiesA large number of factors have been identified byvarious theoreticians and practitioners that affect thequality of software. Many of these are veiy difficult toquantify. The discussion here is based on IEEE 610.12,Evans 1987, Pressman 1987, and Vincent 1988. Thelatter two references based their own discussion onMcCall 1977. The discussion concentrates on definingthose terms that appear important to the design ofreactor protection computer systems. Quotations in thissection come from the references listed above.

    Access Control. The term "access control" relates to"those attributes of the software that provide forcontrol of the access to software and data." In areactor protection system, this refers to the abilityof the utility to prevent unauthorized changes toeither software or data within the computer

    system, incorrect input signals being sent to thecomputer system by intervention of a humanagent, incorrect commands from the operator, andany other forms of tampering. Access controlshould consider both inadvertent and maliciouspenetration.

    Accuracy. Accuracy refers to "those attributes of thesoftware that provide the required precision incalculations and outputs." In some situations, thiscan require a careful error analysis of numericalalgorithms.

    Auditability. Auditability refers to the "ease withwhich conformance to standards can be checked."The careful development of project plans,adherence to those plans, and proper recordkeeping can help make audits easier, morethorough and less intrusive. Sections 3 and 4discuss this topic in great depth.

    Completeness. Completeness properties are "thoseattributes of the software that provide fullimplementation of the functions required." Asoftware design is complete if all requirements arefulfilled in the design. A software implementationis complete if the code fully implements thedesign.

    Consistency. Consistency is defined as "the degree ofuniformity, standardization and freedom fromcontradictions among the documents or parts of asystem or component." Standardized errorhandling is an example of consistency.Requirements are consistent if they do not requirethe system to carry out some function, and underthe same conditions to carry out its negation. Aninconsistent design might cause the system to sendincompatible signals to one or more actuators,causing the protection system to attemptcontradictory actions. An example would bestarting a pump but not opening the intake value.

    Correctness. Correctness refers to the "extent to whicha program satisfies its specifications and fulfillsthe user's mission objectives." This is a broaderdefinition than that given for completeness. It isworth noting that some of the documentsreferenced at the beginning of the sectionessentially equate correctness with completeness,while others distinguish between them. The IEEEStandard 610.12 gives both forms of definition.

    Expandability. Expandability attributes are "thoseattributes of the software that provide forexpansion of data storage requirements or

    15 15 ~NURIEG /CR-6101

  • Section 2. Terminology

    computational functions." The word"extendibility" is sometimes used as a synonym.

    Generality. Generality is "the degree to which asystem or component performs a broad range offunctions." This is not necessarily a desirableattribute of a reactor protection system if thegenerality encompasses functionality beyondsimply protecting the reactor.

    Software Instrumentation. Instrumentation refers to"those attributes of the software that provide formeasurement of usage or identification of errors."A well-instrumented system can monitor its ownoperation, and detect errors in that operation.Software instrumentation can be used to monitorthe hardware operation as well as its ownoperation. A hardware device such as a watch-dogtimer can be used to help monitor the softwareoperation. If instrumentation is required for acomputer system, it may have a considerableeffect on the system design, so must be consideredas part of that design.

    Modularity. Modularity attributes are "those attributesof the software that provide a structure of highlyindependent modules." To achieve modularity, theprotection computer system should be divided intodiscrete hardware and software components insuch a way that a change to one component hasminimal impact on the remaining modules.Modularity is measured by cohesion and coupling(Yourdon 1979).

    Operability. Operability refers to "those attributes ofthe software that determine operation andprocedures concerned with the operation of thesoftware." This quality is concerned with the man-machine interface, and measures the ease withwhich the operators can use the system. This isparticularly a concern during off-normal andemergency conditions when confusion may behigh and mistakes may be unfortunate.

    Robustness. Robustness refers to "the degree to whicha system or component can function correctly inthe presence of invalid inputs or stressfulenvironmental conditions." This quality issometimes referred to as "error tolerance" andmay be implemented by fault tolerance or designdiversity.

    Simplicity. Simplicity attributes are "those attributesthat provide implementation of functions in themost understandable manner." It can be thought ofas the absence of complexity. This is one of themore important design qualities for a reactorcomputer protection system, and is quite difficultto quantify. See Preckshot 1992 for additionalinformation on complexity and scalability.

    A particularly important aspect of complexity isthe distinction between functional complexity andstructural complexity. The former refers to asystem that attempts to carry out many disparatefunctions, and is controlled by limiting the goalsof the system. The latter refers to the method ofcarrying out the functions, and may be controlledby redesigning the system to carry out the samefunctions in a simpler way.

    Testability. Testability refers to "the degree to which asystem or component facilitates the establishmentof test criteria and the performance of tests todetermine whether those criteria have been met."

    Traceability. Traceability attributes are "thoseattributes of the software that provide a threadfrom the requirements to the implementation withrespect to the specific development andoperational environment."

    NUREG/CR-6101 16

  • Section 3. Activities

    3. LIFE CYCLE SOFTWARE RELIABILITYAND SAFETY ACTIVITIES

    Much has been written about software engineering andhow a well-structured development life cycle can helpin the production of correct maintainable softwaresystems. Many standard software engineering activitiesshould be performed for any software project, so arenot discussed in this report. Instead, the reportconcentrates on the additional activities required for asoftware project in which safety is a prime concern.Refer to a general text, such as Macro 1990 orPressman 1987, for general information on softwareengineering.

    Any software development project can be discussedfrom a number of different viewpoints. Examplesinclude the customer, the user, the developer, theproject manager, the general manager, and theassessor. The viewpoint that is presumed will have aconsiderable effect on the topics discussed, andparticularly on the emphasis placed on different aspectsof those topics. The interest here is the viewpoint of theassessor. This is a person (or group of people) whoevaluates both the development process and theproducts of that process for assurance that they meetsome externally-imposed standard. In this report, thosestandards will relate to the reliability of the softwareproducts and the safety of the application in which thesoftware is embedded. The assessor may be a person inthe development organization charged with the duty ofassuring reliability and safety, a person in anindependent auditing organization, or an employee of aregulatory agency. The difference among theseassessors should be the reporting paths, not thetechnical activities that are carried out. Consequentlyno distinction is made here among the different typesof assessor.

    Since this report is written from the viewpoint of theassessor, the production of documents is emphasized inthis report. The documents provide the evidence thatrequired activities have actually taken place. There issome danger that the software developer willconcentrate on the creation of the documents ratherthan the creation of safe reliable software. The assessormust be constantly on guard for this activity. Thesoftware runs the protection system, not thedocuments. There is heavy emphasis below onplanning: creating and following the plans that arenecessary to the development of software where safetyis a particular concern.

    The documents that an assessor should expect to haveavailable, and their contents, is the subject of thissection of the report. The process of assessing thesedocuments is discussed in Section 4.

    3.1. Planning Activities

    Fundamental to the effective management of anyengineering project is the planning that goes into theproject. This is especially true where extremereliability and safety are of concern. While there aregeneral issues of avoiding cost and schedule overruns,the particular concern here is safety. Unless amanagement plan exists, and is followed, theprobability is high that some safety concerns will beoverlooked at some point in the project lifetime, orlack of time or money near the end of the developmentperiod will cause safety concerns to be ignored, ortesting will be abridged. It should be noted that thetime/money/safety tradeoff is a very difficultmanagement issue requiring very wise judgment. Noproject manager should be allowed to claim "safety" asan excuse for unconscionable cost or scheduleoverruns. On the other hand, the project managershould also not be allowed to compromise safety in aneffort to meet totally artificial schedule and budgetconstraints.

    For a computer-based safety system, a number ofdocuments will result from the planning activity. Theseare discussed in this section, insofar as safety is anissue. For example, a software management plan willgenerally involve non-safety aspects of thedevelopment project, which go beyond the discussionin Section 3.1.1.

    Software project planning cannot take place inisolation from the rest of the reactor development. It isassumed that a number of documents are available tothe software project team. At minimum, the followingmust exist:

    Hazards analysis. This identifies hazardousreactor system states, sequences of actions that cancause the reactor to enter a hazardous state,sequences of actions intended to return the reactorfrom a hazardous state to a nonhazardous state,and actions intended to mitigate the consequencesof an accident.

    17 NUREG/CR-6101

  • Section 3. Activities

    " High level reactor system design. This identifiesthose functions that will be performed by theprotection system, and includes a specification ofthose safety-related actions that will be required ofthe software in order to prevent the reactor fromentering a hazardous state, move the reactor from ahazardous state to a non-hazardous state, ormitigate the consequences of an accident.

    " Interfaces between the protection computersystem and the rest of the reactor protectionsystem. That is, what signals must be obtained

    from sensors and what signals must be provided toactuators by the computer system. Interfaces alsoinclude display devices intended for man-machineinteraction.

    Planning a software development project can be acomplex process involving a hierarchy of activities.The entire process is beyond the scope of this report.Figure 3-1, taken from Evans 1983 (copyright 1983 byMichael Evans, Pamela Piazza, and James Dolkas.Reprinted by permission of John Wiley & Sons, Inc.),gives a hint as to the activities involved. Planning isdiscussed in detail in Pressman 1987.

    Software, design production,integration, test, and documentation

    Figure 3-1. Software Planning Activities

    NUREG/CR-6101 18

  • Section 3. Activities

    The result of the planning activity will be a set ofdocuments that will be used to oversee thedevelopment project. These may be packaged asseparate documents, combined into a fewer number ofdocuments, or combined with similar documents usedby the larger reactor project. For example, thedeveloper might choose to include the software V&Vplan in the software project management plan, or toinclude the software configuration management plan ina project-wide configuration management plan. Suchpackaging concerns are beyond the scope of this report.Since some method is necessary in order to discussdocuments, the report assumes that separate documentswill exist. The documents resulting from planninginclude the following minimum set; additionaldocuments may be required by the developmentorganization as part of their standard businessprocedures, or by the assessor due to the nature of theparticular project.

    * Software Project Management Plan

    Software Quality Assurance Plan

    * Software Configuration Management Plan" Software Verification and Validation Plan" Software Safety Plan" Software Development Plan" Software Integration Plan

    * Software Installation Plan

    * Software Maintenance Plan" Software Training Plan" Software Operations Plan

    The actual time at which these documents will beproduced depends on the life cycle used by thesoftware developer. The Software Project ManagementPlan will always need to be done early in the life cycle,since the entire management effort is dependent on it.However, documents such as the Software OperationsPlan might be delayed until the software system isready to install.

    3.1.1. Software Project Management Pian

    The software project management plan (SPMP) is thebasic governing document for the entire developmenteffort. Project oversight, control, reporting, review, andassessment are all carried out within the scope of theSPMP.

    One method of organizing the SPMP is to use IEEEStandard 1058; this is done here. Other methods are

    possible, provided that the topics discussed below areaddressed. The plan contents can be roughly dividedinto several categories: introduction and overview,project organization, managerial processes, technicalprocesses, and budgets and schedules. A sample tableof contents, based on IEEE 1058, is shown in Figure 3-2. Those aspects of the plan that directly affect safetyare discussed next.

    1. Introduction1.1. Project Overview

    1.2. Project Deliverables

    1.3. Evolution of the SPMP1.4. Reference Materials

    1.5. Definitions and Acronyms2. Project Organization

    2.1. Process Model

    2.2. Organizational Structure

    2.3. Organizational Boundaries and Interfaces2.4. Project Responsibilities

    3. Managerial Process3. 1. Management Objectives and Priorities3.2. Assumptions, Dependencies and Constraints

    3.3. Risk Management

    4.

    3.4. Monitoring and Controlling Mechanisms

    3.5. Staffing Plan

    Technical Process

    4. 1. Methods, Tools and Techniques4.2. Software Documentation

    4.3. Project Support Functions

    Work Packages, Schedule and Budget

    5. 1. Work Packages

    5.2. Dependencies

    5.

    5.3. Resource Requirements5.4. Budget and Resource Allocation

    5.5. Schedule

    6. Additional ComponentsIndex

    Appendices

    Figure 3-2. Outline of a Software ProjectManagement Plan

    19 19 NUREG/CR-6101

  • Section 3. Activities

    A combination of text and graphics may be used tocreate and document dhe SPMP. PERT charts,organization charts, matrix diagrams or other formatsare frequently useful.

    3.1.1.1. Project Organization

    This portion of the SPMP addresses organizationalissues; specifically, the process model, organizationalstructure, boundaries and interfaces, and projectresponsibilities. The following items should bediscussed in this portion of the plan.

    * Process Model. Define the relationships amongmajor project functions and activities. Thefollowing specifications must be provided:

    - Timing of major milestones.

    - Project baselines.

    - Timing of project reviews and audits.

    - Work products of the project.

    - Project deliverables.

    * Organization Structure. Describe the internalmanagement structure of the project.

    - Lines of authority.

    - Responsibility for the various aspects of theproject.

    - Lines of communication within the project.

    - The means by which the SPMP will beupdated if the project organization changes.Note that the SPMP should be underconfiguration control; see Section 3.1.3.

    * Organization Boundaries. Describe theadministrative and managerial boundaries, andinterfaces across those boundaries, between theproject and the following external entities.

    - The parent organization.

    - The customer organization.

    - Any subcontractor organizations.

    - The regulatory and auditor organizations.

    - Support organizations, including qualityassurance, verification and validation, andconfiguration management.

    " Project responsibilities. State the'nature of each* major project function and activity, and identify by

    name the individuals who are responsible for

    them. Give the method by which these names canbe changed during the life of the project.

    3.1.1.2. Project Management Procedures

    This section of the SPMP will describe themanagement procedures that will be followed. duringthe project development life cycle. Topics that canaffect safety are listed here; the developmentorganization will normally include additionalinformation in order to completely describe themanagement procedures. The following aspects of theSPMP fall into the categ