Reliability

Reliability engineering

From Wikipedia, the free encyclopedia

Reliability engineeringisengineeringthat emphasizesdependabilityin thelifecycle managementof aproduct. Dependability, or reliability, describes the ability of a system or component to function under stated conditions for a specified period oftime.[1]Reliability engineering represents a sub-discipline withinsystems engineering. Reliability istheoreticallydefined as theprobabilityof success (Reliability=1-Probability of Failure), as the frequency of failures, or in terms ofavailability, as a probability derived from reliability and maintainability.Maintainabilityandmaintenanceis often defined as a part of "reliability engineering" in Reliability Programs. Reliability plays a key role in thecost-effectivenessof systems.

Reliability engineering deals with the estimation and management of high levels of "lifetime" engineeringuncertaintyandrisksof failure. Althoughstochasticparameters define and affect reliability, according to some expert authors on Reliability Engineering, e.g. P. O'Conner, J. Moubray[2]and A. Barnard,[3]reliability is not (solely) achieved by mathematics and statistics. "Nearly all teaching and literature on the subject emphasize these aspects, and ignore the reality that the ranges of uncertainty involved largely invalidate quantitative methods for prediction and measurement"[4]Reliability engineering relates closely tosafety engineeringand tosystem safety, in that they use common methods for their analysis and may require input from each other. Reliability engineering focuses on costs of failure caused by system downtime, cost of spares, repair equipment, personnel and cost of warranty claims. Safety engineering normally emphasizes not cost, but preserving life and nature, and therefore deals only with particular dangerous system-failure modes. High reliability (safety factor) levels also result from good engineering, from attention to detail and almost never from only re-active failure management (reliability accounting / statistics).[5]A former United States Secretary of Defense, economistJames R. Schlesinger, once stated: "Reliability is, after all, engineering in its most practical form."[4]Contents

[hide]History[6][edit]The word reliability can be traced back to 1816, by poet Samuel Coleridge.[7]Before World War II the name has been linked mostly to repeatability. A test (in any type of science) was considered reliable if the same results would be obtained repeatedly. In the 1920s product improvement through the use of statistical quality control was promoted by Dr. Walter A. Shewart at Bell Labs.[8]Around this time Wallodi Weibull was working on statistical models for fatigue. The development of reliability engineering was here on a parallel path with quality. The modern use of the word reliability was defined by the U.S. military in the 1940s and evolved to the present. It initially came to mean that a product would operate when expected (nowadays called mission readiness)andfor a specified period of time. A notable figure who played very important roles in engineering, specially with respect to reliability engineering, was the German rocket scientistWernher von Braun according NASA, without any doubt, the greatest rocket scientist in history. In his early life he worked on the German V1 missiles, which were plagued with reliability issues that could not be solved at that time by component improvements. One of the first ideas to implement redundancy in systems were adapted at this time. Wernher von Braun was also during his long career at NASA famous to his very conservative engineering approach using ample safety factors and redundancy. Some say that this has resulted in the lost U.S. race for the first man into space, but to the success of the first man on the Moon. He was the chief architect of the Saturn V launch vehicle, the superbooster that propelled theApollo spacecraftto the Moon.

In the time around the WWII and later, many reliability issues were due to inherent reliability of electronics and fatigue issues. In 1945 M.A. Miner published the seminal paper titled Cumulative Damage in fatigue in an ASME journal. A main application for reliability engineering in the military was for the vacuum tube as used in radar systems and other electronics, which reliability has proved to be very problematic and costly. The IEEE formed the Reliability Society in 1948. In 1950, on the military side, a group called the Advisory Group on the Reliability of Electronic Equipment, AGREE, was born. This group recommended the following 3 main ways of working:

1. Improve Component Reliability

2. Establish quality and reliability requirements (also) for suppliers

3. Collect field data and find root causes of failures

In the 1960s more emphasis was given to reliability testing on componentandsystem level. The famous military standard 781 was created at that time. Around this period also the much used and also debated military handbook 217 was published by the RCA (Radio Corporation of America) used for the prediction of failure rates of components. The emphasis on component reliability and empirical research (e.g. mil std 217) alone slowly decreases. More pragmatic approaches as used in the consumer industries are being used. The 1980s was a decade of great changes. Televisions had become all semiconductor. Automobiles rapidly increased their use of semiconductors with a variety of microcomputers under the hood and in the dash. Large air conditioning systems developed electronic controllers, as had microwave ovens and a variety of other appliances. Communications systems began to adopt electronics to replace older mechanical switching systems. Bellcore issued the first consumer prediction methodology for telecommunications and SAE developed a similar document SAE870050 for automotive applications. The nature of predictions evolved during the decade and it became apparent that die complexity wasn't the only factor that determined failure rates for ICs. Kam Wong published a paper questioning the bathtub curve[9]- see alsoReliability Centered Maintenance. During this decade, the failure rate of many components dropped by a factor of 10. Software became important to the reliability of systems. By the 1990s, the pace of IC development was picking up. Wider use of stand alone microcomputers was common and the PC market helped keep IC densities following Moores Law and doubling about every 18 months. Reliability Engineering now was more changing towards understanding thephysics of failure. Failure rates for components kept in dropping, but system levels issues become more prominent.Systems Thinkingbecomes now more and more important. For software the CCM model (Capability Maturity Model) was developed, which gave a more qualitative approach to reliability. ISO 9000 added reliability measures as part of the design and development portion of Certification. The expansion of the World Wide Web created new challenges of security and trust. The older problem of too little reliability information available had now been replaced by too much information of questionable value. Consumer reliability problems could now have data and be discussed online in real time. New technologies such as micro-electro mechanical systems (MEMS), handheld GPS, and hand-held devices that combined cell phones and computers all represent challenges to maintain reliability. Product development time continued to shorten through this decade and what had been done in three years was now done in 18 months. This meant reliability tools and tasks must be more closely tied to the development process itself. In many ways, reliability became part of everyday life and consumer expectations.

Overview[edit]Objective[edit]The objectives of reliability engineering, in the order of priority, are:[10]1. To apply engineering knowledge and specialist techniques to prevent or to reduce the likelihood or frequency of failures.

2. To identify and correct the causes of failures that do occur, despite the efforts to prevent them.

3. To determine ways of coping with failures that do occur, if their causes have not been corrected.

4. To apply methods for estimating the likely reliability of new designs, and for analysing reliability data.

The reason for the priority emphasis is that it is by far the most effective way of working, in terms of minimizing costs and generating reliable products.The primary skills that are required, therefore, are the ability to understand and anticipate the possible causes of failures, and knowledge of how to prevent them. It is also necessary to have knowledge of the methods that can be used for analysing designs and data.

Scope and techniques to be used within Reliability Engineering[edit]Reliability engineering forcomplex systemsrequires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may in that case involve:

System availability and mission readiness analysis and related reliability and maintenance requirement allocation

Functional System Failure analysis and derived requirements specification

Inherent (system) Design Reliability Analysis and derived requirements specification: for both Hardware- and Software design

System Diagnostics design

Predictive and Preventive maintenance (e.g. Reliability Centered Maintenance)

Human Factors / Human Interaction / Human Errors

Manufacturing and Assembly induced failures (non 0-hour Quality)

Maintenance induced failures

Transport induced failures

Storage induced failures

Use (load) studies, component stress analysis and derived requirements specification

Software(systematic) failures

Failure / reliability testing

Field failure monitoring and corrective actions

Spare-parts stocking (Availability control)

Technical documentation, caution and warning analysis

Data and information acquisition/organisation (Creation of a general reliability development Hazard Log and FRACAS system)

Effective reliability engineering requires understanding of the basics offailure mechanismsfor which experience, broad engineering skills and good knowledge from many different special fields of engineering,[11]like:

Tribology Stress (mechanics) Fracture mechanics/Fatigue (material) Thermal engineering

Fluid mechanics / shock loading engineering

Electrical engineering

Chemical engineering (e.g. Corrosion)

Material science

...

Definitions[edit]Reliabilitymay be defined in the following ways:

The idea that an item is fit for a purpose with respect to time

The capacity of a designed, produced or maintained item to perform as required over time

The capacity of a population of designed, produced or maintained items to perform as required over specified time

The resistance to failure of an item over time

The probability of an item to perform a required function under stated conditions for a specified period oftime The durability of an object.

Basics of a Reliability Assessment[edit]Many engineering techniques are used in reliabilityrisk assessments, such as reliabilityhazard analysis,failure mode and effects analysis(FMEA),[12]fault tree analysis(FTA),Reliability Centered Maintenance, material stress and wear calculations, fatigue and creep analysis, human error analysis, reliability testing, etc. Because of the large number of reliability techniques, their expense, and the varying degrees of reliability required for different situations, most projects develop a reliability program plan to specify the reliability tasks that will be performed for that specific system.

Consistent with the creation of asafety cases, for exampleARP4761, the goal of reliability assessments is to provide a robust set of qualitative and quantitative evidence that use of a component or system will not be associated with unacceptable risk. The basic steps to take[13]are to:

First thoroughlyidentifyrelevant unreliability "hazards", e.g. potential conditions, events, human errors, failure modes, interactions, failure mechanisms and root causes, by specific analysis or tests

Assess the associated system risk, by specific analysis or testing

Propose mitigation, e.g.requirements, design changes, detection logic,maintenance, training, by which the risks may be lowered and controlled for at an acceptable level.

Determine the best mitigation and get agreement on final, acceptable risk levels, possibly based on cost-benefit analysis

Riskis here the combination of probabilityandseverity of the failure incident (scenario) occurring.

In a deminimus definition, severity of failures include the cost of spare parts, man hours, logistics, damage (secondary failures) and downtime of machines which may cause production loss. A more complete definition of failure also can mean injury, dismemberment and death of people within the system (witness mine accidents, industrial accidents, space shuttle failures) and the same to innocent bystanders (witness the citizenry of cities like Bhopal, Love Canal, Chernobyl or Sendai and other victims of the 2011 Thoku earthquake and tsunami) - in this case, Reliability Engineering becomes System Safety. What is acceptable is determined by the managing authority or customers or the effected communities. Residual risk is the risk that is left over after all reliability activities have finished and includes theun-identifiedrisk and is therefore not completely quantifiable.

Reliability and availability program plan[edit]A reliability program plan is used to document exactly what "best practices" (tasks, methods, tools, analysis and tests) are required for a particular (sub)system, as well as clarify customer requirements for reliability assessment. For large scale, complex systems, the reliability program plan should be a separatedocument. Resource determination for manpower and budgets for testing and other tasks is critical for a successful program. In general, the amount of work required for an effective program for complex systems is large.

A reliability program plan is essential for achieving high levels of reliability, testability,maintainabilityand the resulting systemAvailabilityand is developed early during system development and refined over the systems life-cycle. It specifies not only what the reliability engineer does, but also the tasks performed by otherstakeholders. A reliability program plan is approved by top program management, which is responsible for allocation of sufficient resources for its implementation.

A reliability program plan may also be used to evaluate and improve availability of a system by the strategy on focusing on increasing testability & maintainability and not on reliability. Improving maintainability is generally easier than reliability. Maintainability estimates (Repair rates) are also generally more accurate. However, because the uncertainties in the reliability estimates are in most cases very large, it is likely to dominate the availability (prediction uncertainty) problem; even in the case maintainability levels are very high. When reliability is not under control more complicated issues may arise, like manpower (maintainers / customer service capability) shortage, spare part availability, logistic delays, lack of repair facilities, extensive retro-fit and complex configuration management costs and others. The problem of unreliability may be increased also due to the "domino effect" of maintenance induced failures after repairs. Only focusing on maintainability is therefore not enough. If failures are prevented, none of the others are of any importance and therefore reliability is generally regarded as the most important part of availability. Reliability needs to be evaluated and improved related to both availabilityandthe cost of ownership (due to cost of spare parts, maintenance man-hours, transport costs, storage cost, part obsolete risks, etc.). But, as GM and Toyota have belatedly discovered, TCO also includes the down-stream liability costs when reliability calculations do not sufficiently or accurately address customers' personal bodily risks. Often a trade-off is needed between the two. There might be a maximum ratio between availability and cost of ownership. Testability of a system should also be addressed in the plan as this is the link between reliability and maintainability. The maintenance strategy can influence the reliability of a system (e.g. by preventive and/or predictive maintenance), although it can never bring it above the inherent reliability.

The reliability plan should clearly provide a strategy for availability control. Whether only availability or also cost of ownership is more important depends on the use of the system. For example, a system that is a critical link in a production system e.g. a big oil platform is normally allowed to have a very high cost of ownership if this translates to even a minor increase in availability, as the unavailability of the platform results in a massive loss of revenue which can easily exceed the high cost of ownership. A proper reliability plan should always address RAMT analysis in its total context. RAMT stands in this case for reliability, availability, maintainability/maintenance and testability in context to the customer needs.

Reliability requirements[edit]For any system, one of the first tasks of reliability engineering is to adequately specify the reliability and maintainability requirementsderivedfrom the overallavailabilityneeds and more importantly, from properdesign failure analysisor preliminaryprototype test results. Clear (able to design to) Requirements should constrain the designers from designing particular unreliable items / constructions / interfaces / systems. Settingonlyavailability (reliability, testability and maintainability) allocated targets (e.g. max. Failure rates) is not appropriate. This is a broad misunderstanding about Reliability Requirements Engineering. Reliability requirements address the system itself, including test and assessment requirements, and associated tasks and documentation. Reliability requirements are included in the appropriate system or subsystem requirements specifications, test plans and contract statements. Creation of proper lower level requirements is critical.[14]Provision of only quantitative minimum targets (e.g. MTBF values/ Failure rates) is not sufficient for different reasons. One reason is that afull validation(related to correctness and verifiability in time) of an quantitative reliabilityallocation(requirement spec) on lower levels for complex systems can (often) not be made as a consequence of 1) The fact that the requirements are probabalistic 2) The extremely high level of uncertainties involved for showing compliance with all these probabalistic requirements 3) Reliability is a function of time and accurate estimates of a (probabalistic) reliability number per item are available only very late in the project, sometimes even only many years after in-service use. Compare this problem with the continues (re-)balancing of for example lower level system mass requirements in the development of an aircraft, which is already often a big undertaking. Notice that in this case masses do only differ in terms of only some%, are not a function of time, the data is non-probabalistic and available already in CAD models. In case of reliability, the levels of unreliability (failure rates) may change with factors of decades (1000's of%) as result of very minor deviations in design, process or anything else.[15]The information is often not available without huge uncertainties within the development phase. This makes this allocation problem almost impossible to do in a useful, practical, valid manner, which does not result in massive over- or under specification. A pragmatic approach is therefore needed. For example; the use of general levels / classes of quantitative requirements only depending on severity of failure effects. Also the validation of results is a far more subjective task than for any other type of requirement. (Quantitative) Reliability parameters -in terms of MTBF - are by far the most uncertain design parameters in any design.

Furthermore, reliability design requirements should drive a (system or part) design to incorporate features that prevent failures from occurring or limit consequences from failure in the first place! Not only to make some predictions, this could potentially distract the engineering effort to a kind of accounting work. A design requirement should be so precise enough so that a designer can "design to" it and can also prove -through analysis or testing- that the requirement has been achieved, and if possible within some a stated confidence. Any type of reliability requirement should be detailed and could be derived from failure analysis (Finite Element Stress and Fatigue analysis, Reliability Hazard Analysis, FTA, FMEA, Human Factor analysis, Functional Hazard Analysis, etc.) or any type of reliability testing. Also, requirements are needed for verification tests e.g. required overload loads (or stresses) and test time needed. To derive these requirements in an effective manner, asystems engineeringbased risk assessment and mitigation logic should be used. Robust hazard log systems are to be created that contain detailed information on why and how systems could or have failed. Requirements are to be derived and tracked in this way. These practical design requirements shall drive the design and not only be used for verification purposes. These requirements (often design constraints) are in this wayderivedfrom failure analysis or preliminary tests. Understanding of this difference with only pure quantitative (logistic) requirement specification (e.g. Failure Rate / MTBF setting) is paramount in the development of successful (complex) systems.[16]The maintainability requirements address the costs of repairs as well as repair time. Testability (not to be confused with test requirements) requirements provide the link between reliability and maintainability and should address detectability of failure modes (on a particular system level), isolation levels and the creation of diagnostics (procedures).

As indicated above, reliability engineers should also address requirements for various reliability tasks and documentation during system development, test, production, and operation. These requirements are generally specified in the contract statement of work and depend on how much leeway the customer wishes to provide to the contractor. Reliability tasks include various analyses, planning, and failure reporting. Task selection depends on the criticality of the system as well as cost. A safety critical system may require a formal failure reporting and review process throughout development, whereas a non-critical system may rely on final test reports. The most common reliability program tasks are documented in reliability program standards, such as MIL-STD-785 and IEEE 1332. Failure reporting analysis and corrective action systems are a common approach for product/process reliability monitoring.

Reliability culture / Human Errors / Human Factors[edit]Practically, most failures can in the end be traced back to a root causes of the type ofhuman errorof any kind. For example, human errors in:

Management decisions on for example budgeting, timing and required tasks

Systems Engineering: Use studies (load cases)

Systems Engineering: Requirement analysis / setting

Systems Engineering: Configuration control

Assumptions

Calculations / simulations / FEM analysis

Design

Design drawings

Testing (incorrect load settings or failure measurement)

Statistical analysis

Manufacturing

Quality control

Maintenance

Maintenance manuals

Training

Classifying and Ordering of information

feedback of field information (incorrect or vague)

etc.

However, humans are also very good in detection of (the same) failures, correction of failures and improvising when abnormal situations occur. The policy that human actions should be completely ruled out of any design and production process to improve reliability may not be effective therefore. Some tasks are better performed by humans and some are better performed by machines.[17]Furthermore, human errors in management and the organization of data and information or the misuse or abuse of items may also contribute to unreliability. This is the core reason why high levels of reliability for complex systems can only be achieved by following a robustsystems engineeringprocess with proper planning and execution of the validation and verification tasks. This also includes careful organization of data and information sharing and creating a "reliability culture" in the same sense as having a "safety culture" is paramount in the development of safety critical systems.

Reliability prediction and improvement[edit]Reliability prediction is the combination of the creation of a proper reliability model (see further on this page) together with estimating (and justifying) the input parameters for this model (like failure rates for a particular failure mode or event and the mean time to repair the system for a particular failure) and finally to provide a system (or part) level estimate for the output reliability parameters (system availability or a particular functional failure frequency).

Some recognized reliability engineering specialists e.g. Patrick O'Connor, R. Barnard have argued that too much emphasis is often given to the prediction of reliability parameters and more effort should be devoted to the prevention of failure (reliability improvement).[4]Failures can and should be prevented in the first place for most cases. The emphasis on quantification and target setting in terms of (e.g.) MTBF might provide the idea that there is a limit to the amount of reliability that can be achieved. In theory there isnoinherent limit and higher reliability doesnotneed to be more costly in development. Another of their arguments is that prediction of reliability based on historic data can be very misleading, as a comparison is only valid for exactly the same designs, products, manufacturing processes and maintenance under exactly the same loads and environmental context. Even a minor change in detail in any of these could have major effects on reliability. Furthermore, normally the most unreliable and important items (most interesting candidates for a reliability investigation) are most often subjected to many modifications and changes. Engineering designs are in most industries updated frequently. This is the reason why the standard (re-active or pro-active) statistical methods and processes as used in the medical industry or insurance branch are not as effective for engineering. Another surprising but logical argument is that to be able to accurately predict reliability by testing, the exact mechanisms of failure must have been known in most cases and therefore in most cases can be prevented! Following the incorrect route by trying to quantify and solving a complex reliabilityengineeringproblem in terms of MTBF or Probability and using the re-active approach is referred to by Barnard as "Playing the Numbers Game" and is regarded as bad practise.[5]For existing systems, it is arguable that responsible programs would directly analyse and try to correct the root cause of discovered failures and thereby may render the initial MTBF estimate fully invalid as new assumptions (subject to high error levels) of the effect of the patch/redesign must be made. Another practical issue concerns a general lack of availability of detailed failure data and not consistent filtering of failure (feedback) data or ignoring statistical errors, which are very high for rare events (like reliability related failures). Very clear guidelines must be present to be able to count and compare failures, related to different type of root-causes (e.g. manufacturing-, maintenance-, transport-, system-induced or inherent design failures, ). Comparing different type of causes may lead to incorrect estimations and incorrect business decisions about the focus of improvement.

To perform a proper quantitative reliability prediction for systems may be difficult and may be very expensive if done by testing. On part level, results can be obtained often with higher confidence as many samples might be used for the available testing financial budget, however unfortunately these tests might lack validity on system level due to the assumptions that had to be made for part level testing. These authors argue that it can not be emphasized enough that testing for reliability should be done to create failures in the first place, learn from them andto improvethe system / part. The general conclusion is drawn that anaccurateand an absolute prediction by field data comparison or testing of reliability is in most cases not possible. An exception might be failures due to wear-out problems like fatigue failures. In the introduction of MIL-STD-785 it is written that reliability prediction should be used with great caution if not only used forcomparisonin trade-off studies.

See also:Risk Assessment#Quantitative risk assessment Critics paragraph

Design for reliability[edit]Reliability design begins with the development of a (system)model. Reliability and availability models useblock diagramsandFault Tree Analysisto provide a graphical means of evaluating the relationships between different parts of the system. These modelsmayincorporate predictions based on failure rates taken from historical data. While the (input data) predictions are often not accurate in an absolute sense, they are valuable to assess relative differences in design alternatives. Maintainability parameters, for exampleMTTR, are other inputs for these models.

The most important fundamental initiating causes and failure mechanisms are to be identified and analyzed with engineering tools. A diverse set of practical guidance and practical performance and reliability requirements should be provided to designers so they can generate low-stressed designs and products that protect or are protected against damage and excessive wear. Proper Validation of input loads (requirements) may be needed and verification for reliability "performance" by testing may be needed.

A Fault Tree Diagram

One of the most important design techniques isredundancy. This means that if one part of the system fails, there is an alternate success path, such as a backup system. The reason why this is the ultimate design choice is related to the fact that high confidence reliability evidence for new parts / items is often not available or extremely expensive to obtain. By creating redundancy, together with a high level of failure monitoring and the avoidance of common cause failures, even a system with relative bad single channel (part) reliability, can be made highly reliable (mission reliability) on system level. No testing of reliability has to be required for this. Furthermore, by using redundancy and the use of dissimilar design and manufacturing processes (different suppliers) for the single independent channels, less sensitivity for quality issues (early childhood failures) is created and very high levels of reliability can be achieved at all moments of the development cycles (early life times and long term). Redundancy can also be applied in systems engineering by double checking requirements, data, designs, calculations, software and tests to overcome systematic failures.

Another design technique to prevent failures is calledphysics of failure. This technique relies on understanding the physical static and dynamic failure mechanisms. It accounts for variation in load, strength and stress leading to failure at high level of detail, possible with use of modernfinite element method(FEM) software programs that may handle complex geometries and mechanisms like creep, stress relaxation, fatigue and probabilistic design (Monte Carlo simulations / DOE). The material or component can be re-designed to reduce the probability of failure and to make it more robust against variation. Another common design technique iscomponentderating: Selecting components whose tolerance significantly exceeds the expected stress, as using a heavier gauge wire that exceeds the normal specification for the expectedelectrical current.

Another effective way to deal with unreliability issues is to perform analysis to be able to predict degradation and being able to prevent unscheduled down events / failures from occurring.RCM(Reliability Centered Maintenance) programs can be used for this.

Many tasks, techniques and analyses are specific to particular industries and applications. Commonly these include:

Built-in test (BIT) (testability analysis)Failure mode and effects analysis(FMEA)Reliabilityhazard analysis Reliability block-diagram analysis

Dynamic Reliability block-diagram analysis[18] Fault tree analysis Root cause analysis Statistical Engineering,Design of Experiments- e.g. on Simulations / FEM models or with testing

Sneak circuit analysis

Accelerated testing

Reliability growth analysis (re-active reliability)

Weibullanalysis (for testing or mainly "re-active" reliability)

Thermal analysisby finite element analysis (FEA) and / or measurement

Thermal induced, shock andvibration fatigueanalysis by FEA and / or measurement

Electromagnetic analysis

Avoidance ofsingle point of failure Functional analysis and functional failure analysis (e.g., function FMEA, FHA or FFA)

Predictive and preventive maintenance: reliability centered maintenance (RCM) analysis

Testability analysis

Failure diagnostics analysis (normally also incorporated in FMEA)

Human error analysis

Operational hazard analysis

Manual screening

Integrated logistics supportResults are presented during the system design reviews and logistics reviews. Reliability is just one requirement among many system requirements. Engineering trade studies are used to determine theoptimumbalance between reliability and other requirements and constraints.

Quantitative and Qualitative approaches and the importance of language in reliability engineering[edit]Reliability engineers could concentrate more on "why and how" items / systems may fail or have failed, instead of mostly trying to predict "when" or at what (changing) rate (failure rate (t)). Answers to the first questions will drive improvement in design and processes.[4]When failure mechanisms are really understood than solutions to prevent failure are easily found. Only required Numbers (e.g. MTBF) will not drive good designs. The huge amount of (un)reliability hazards that are generally part of complex systems need first to be classified and ordered (based on qualitative and quantitative logic if possible) to get to efficient assessment and improvement. This is partly done in pure language andpropositionlogic, but also based on experience with similar items. This can for example be seen in descriptions of events inFault Tree Analysis,FMEAanalysis and a hazard (tracking) log. In this sense language and proper grammar (part of qualitative analysis) plays an important role in reliability engineering, just like it does insafety engineeringor in general withinsystems engineering. Engineers are likely to question why? Well, it is precisely needed because systems engineering is very much about finding the correct words to describe the problem (and related risks) to be solved by the engineering solutions we intend to create. In the words of Jack Ring, the systems engineers job is to language the project. [Ring et al. 2000].[19]Language in itself is about putting an order in a description of the reality of a (failure of a) complex function/item/system in a complex surrounding. Reliability engineers use both quantitativeandqualitative methods, which extensively use language to pinpoint the risks to be solved.

The importance of language also relates to the risks ofhuman error, which can be seen as the ultimate root cause of almost all failures - see further on this site. As an example, proper instructions (often written by technical authors in so calledsimplified English) in maintenance manuals, operation manuals, emergency procedures and others are needed to prevent systematic human errors in any maintenance or operational task that may result in system failures.

Reliability modelling[edit]Reliability modelling is the process of predicting or understanding the reliability of a component or system prior to its implementation. Two types of analysis that are often used to model a complete system availability (including effects from logistics issues like spare part provisioning, transport and manpower) behavior areFault Tree Analysisand reliability block diagrams. On component level the same type of analysis can be used together with others. The input for the models can come from many sources: Testing, Earlier operational experience field data or data handbooks from the same or mixed industries can be used. In all cases, the data must be used with great caution as predictions are only valid in case the same product in the same context is used. Often predictions are only made to compare alternatives.

A reliability block diagram showing a 1oo3 (1 out of 3) redundant designed subsystemFor part level predictions, two separate fields of investigation are common:

Thephysics of failureapproach uses an understanding of physical failure mechanisms involved, such as mechanicalcrack propagationor chemicalcorrosiondegradation or failure;

Theparts stress modellingapproach is an empirical method for prediction based on counting the number and type of components of the system, and the stress they undergo during operation.

Software reliabilityis a more challenging area that must be considered when it is a considerable component to system functionality.

Reliability theory[edit]Main articles:Reliability theory,Failure rateandSurvival analysisReliability is defined as theprobabilitythat a device will perform its intended function during a specified period of time under stated conditions. Mathematically, this may be expressed as,

,whereis the failureprobability density functionandis the length of the period of time (which is assumed to start from time zero).

There are a few key elements of this definition:

Reliability is predicated on "intended function:" Generally, this is taken to mean operation without failure. However, even if no individual part of the system fails, but the system as a whole does not do what was intended, then it is still charged against the system reliability. The system requirements specification is the criterion against which reliability is measured.

1. Reliability applies to a specified period of time. In practical terms, this means that a system has a specified chance that it will operate without failure before time. Reliability engineering ensures that components and materials will meet the requirements during the specified time. Units other than time may sometimes be used.

2. Reliability is restricted to operation under stated (or explicitly defined) conditions. This constraint is necessary because it is impossible to design a system for unlimited conditions. AMars Roverwill have different specified conditions than a family car. The operating environment must be addressed during design and testing. That same rover may be required to operate in varying conditions requiring additional scrutiny.

Quantitative system reliability parameters theory[edit]Quantitative Requirements are specified using reliabilityparameters. The most common reliability parameter is themean time to failure(MTTF), which can also be specified as thefailure rate(this is expressed as a frequency or conditional probability density function (PDF)) or the number of failures during a given period. These parameters may be useful for higher system levels and systems that are operated frequently, such as most vehicles, machinery, and electronic equipment. Reliability increases as the MTTF increases. The MTTF is usually specified in hours, but can also be used with other units of measurement, such as miles or cycles. Using MTTF values on lower system levels can be very misleading, specially if the Failures Modes and Mechanisms it concerns (The F in MTTF) are not specified with it.[15]In other cases, reliability is specified as the probability of mission success. For example, reliability of a scheduled aircraft flight can be specified as a dimensionless probability or a percentage, as insystem safetyengineering.

A special case of mission success is the single-shot device or system. These are devices or systems that remain relatively dormant and only operate once. Examples include automobileairbags, thermalbatteriesandmissiles. Single-shot reliability is specified as a probability of one-time success, or is subsumed into a related parameter. Single-shot missile reliability may be specified as a requirement for the probability of a hit. For such systems,the probability of failure on demand (PFD)is the reliability measure which actually is an unavailability number. This PFD is derived from failure rate (a frequency of occurrence) and mission time for non-repairable systems.

For repairable systems, it is obtained from failure rate and mean-time-to-repair (MTTR) and test interval. This measure may not be unique for a given system as this measure depends on the kind of demand. In addition to system level requirements, reliability requirements may be specified for critical subsystems. In most cases, reliability parameters are specified with appropriate statisticalconfidence intervals.

Reliability testing[edit]

A reliability sequential test plan

The purpose of reliability testing is to discover potential problems with the design as early as possible and, ultimately, provide confidence that the system meets its reliability requirements.

Reliability testing may be performed at several levels and there are different types of testing. Complex systems may be tested at component, circuit board, unit, assembly, subsystem and system levels[20][1]. (The test level nomenclature varies among applications.) For example, performing environmental stress screening tests at lower levels, such as piece parts or small assemblies, catches problems before they cause failures at higher levels. Testing proceeds during each level of integration through full-up system testing, developmental testing, and operational testing, thereby reducing program risk. However, testing does not mitigate unreliability risk.

With each test both a statistical type 1 and type 2 error could be made and depends on sample size, test time, assumptions and the needed discrimination ratio. There is risk of incorrectly accepting a bad design (type 1 error) and the risk of incorrectly rejecting a good design (type 2 error).

It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; somefailure modesmay take years to observe; some complex interactions result in a huge number of possible test cases; and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing can be used, such as (highly) accelerated life testing,design of experiments, andsimulations.

The desired level of statistical confidence also plays an role in reliability testing. Statistical confidence is increased by increasing either the test time or the number of items tested. Reliability test plans are designed to achieve the specified reliability at the specifiedconfidence levelwith the minimum number of test units and test time. Different test plans result in different levels of risk to the producer and consumer. The desired reliability, statistical confidence, and risk levels for each side influence the ultimate test plan. The customer and developer should agree in advance on how reliability requirements will be tested.

A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are many situations where it is not clear whether a failure is really the fault of the system. Variations in test conditions, operator differences, weather and unexpected situations create differences between the customer and the system developer. One strategy to address this issue is to use ascoring conferenceprocess. A scoring conference includes representatives from the customer, the developer, the test organization, the reliability organization, and sometimes independent observers. The scoring conference process is defined in the statement of work. Each test case is considered by the group and "scored" as a success or failure. This scoring is the official result used by the reliability engineer.

As part of the requirements phase, the reliability engineer develops a test strategy with the customer. The test strategy makes trade-offs between the needs of the reliability organization, which wants as much data as possible, and constraints such as cost, schedule and available resources. Test plans and procedures are developed for each reliability test, and results are documented.

Reliability test requirements[edit]Reliability test requirements can follow from any analysis for which the first estimate of failure probability, failure mode or effect needs to be justified. Evidence can be generated with some level of confidence by testing. With software-based systems, the probability is a mix of software and hardware-based failures. Testing reliability requirements is problematic for several reasons. A single test is in most cases insufficient to generate enough statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impractical, and environmental conditions can be hard to predict over a systems life-cycle.

Reliability engineering is used to design a realistic and affordable test program that provides empirical evidence that the system meets its reliability requirements. Statisticalconfidence levelsare used to address some of these concerns. A certain parameter is expressed along with a corresponding confidence level: for example, anMTBFof 1000 hours at 90% confidence level. From this specification, the reliability engineer can, for example, design a test with explicit criteria for the number of hours and number of failures until the requirement is met or failed. Different sorts of tests are possible.

The combination of required reliability level and required confidence level greatly affects the development cost and the risk to both the customer and producer. Care is needed to select the best combination of requirements e.g. cost-effectiveness. Reliability testing may be performed at various levels, such as component,subsystemandsystem. Also, many factors must be addressed during testing and operation, such as extreme temperature and humidity, shock, vibration, or other environmental factors (like loss of signal, cooling or power; or other catastrophes such as fire, floods, excessive heat, physical or security violations or other myriad forms of damage or degradation). For systems that must last many years, accelerated life tests may be needed.

Accelerated testing[edit]The purpose ofaccelerated life testing (ALT test)is to induce field failure in the laboratory at a much faster rate by providing a harsher, but nonetheless representative, environment. In such a test, the product is expected to fail in the lab just as it would have failed in the fieldbut in much less time. The main objective of an accelerated test is either of the following:

To discover failure modes

To predict the normal field life from the highstresslab life

AnAccelerated testingprogram can be broken down into the following steps:

Define objective and scope of the test

Collect required information about the product

Identify the stress(es)

Determine level of stress(es)

Conduct the accelerated test and analyze the collected data.

Common way to determine a life stress relationship are

Arrhenius model

Eyring model

Inverse power law model

Temperaturehumidity model

Temperature non-thermal model

Software reliability[edit]Further information:Software reliabilitySoftware reliability is a special aspect of reliability engineering. System reliability, by definition, includes all parts of the system, including hardware, software, supporting infrastructure (including critical external interfaces), operators and procedures. Traditionally, reliability engineering focuses on critical hardware parts of the system. Since the widespread use of digitalintegrated circuittechnology, software has become an increasingly critical part of most electronics and, hence, nearly all present day systems.

There are significant differences, however, in how software and hardware behave. Most hardware unreliability is the result of a component or material failure that results in the system not performing its intended function. Repairing or replacing the hardware component restores the system to its original operating state. However, software does not fail in the same sense that hardware fails. Instead, software unreliability is the result of unanticipated results of software operations. Even relatively small software programs can have astronomically largecombinationsof inputs and states that are infeasible to exhaustively test. Restoring software to its original state only works until the same combination of inputs and states results in the same unintended result. Software reliability engineering must take this into account.

Despite this difference in the source of failure between software and hardware, severalsoftware reliability modelsbased on statistics have been proposed to quantify what we experience with software: the longer software is run, the higher the probability that it will eventually be used in an untested manner and exhibit a latent defect that results in a failure (Shooman1987), (Musa 2005), (Denney 2005).

As with hardware, software reliability depends on good requirements, design and implementation. Software reliability engineering relies heavily on a disciplinedsoftware engineeringprocess to anticipate and design againstunintended consequences. There is more overlap between softwarequality engineeringand software reliability engineering than between hardware quality and reliability. A good software development plan is a key aspect of the software reliability program. The software development plan describes the design and coding standards,peer reviews,unit tests,configuration management,software metricsand software models to be used during software development.

A common reliability metric is the number of software faults, usually expressed as faults per thousand lines of code. This metric, along with software execution time, is key to most software reliability models and estimates. The theory is that the software reliability increases as the number of faults (or fault density) decreases or goes down. Establishing a direct connection between fault density and mean-time-between-failure is difficult, however, because of the way software faults are distributed in the code, their severity, and the probability of the combination of inputs necessary to encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability engineer. Other software metrics, such as complexity, are also used. This metric remains controversial, since changes in software development and verification practices can have dramatic impact on overall defect rates.Testing is even more important for software than hardware. Even the best software development process results in some software faults that are nearly undetectable until tested. As with hardware, software is tested at several levels, starting with individualunits,throughintegratioandfull-up system testing. Unlike hardware, it is inadvisable to skip levels of software testing. During all phases of testing, software faults are discovered, corrected, and re-tested. Reliability estimates are updated based on the fault density and other metrics. At a system level, mean-time-between-failure data can be collected and used to estimate reliability. Unlike hardware, performing exactly the same test on exactly the same software configuration does not provide increased statistical confidence. Instead, software reliability uses different metrics, such ascode coverage.

Eventually, the software is integrated with the hardware in the top-level system, and software reliability is subsumed by system reliability. The Software Engineering Institute'scapability maturity modelis a common means of assessing the overall software development process for reliability and quality purposes.

Reliability engineering vs safety engineering[edit]Reliability engineering differs fromsafety engineeringwith respect to the kind of hazards that are considered. Reliability engineering is in the end only concerned with cost. It relates to all Reliability hazards that could transform into incidents with a particular level of loss of revenue for the company or the customer. These can be cost due to loss of production due to system unavailability, unexpected high or low demands for spares, repair costs, man hours, (multiple) re-designs, interruptions on normal production (e.g. due to high repair times or due to unexpected demands for non-stocked spares) and many other indirect costs.[21]Safety engineering, on the other hand, is more specific and regulated. It relates to only very specific and system safety hazards that could potentially lead to severe accidents and is primarily concerned with loss of life, loss of equipment, or environmental damage. The related system functional reliability requirements are sometimes extremely high. It deals with unwanted dangerous events (for life, property, and environment) in the same sense as reliability engineering, but does normally not directly look at cost and is not concerned with repair actions after failure / accidents (on system level). Another difference is the level of impact of failures on society and the control of governments. Safety engineering is often strictly controlled by governments (e.g. nuclear, aerospace, defense, rail and oil industries).[21]Furthermore, safety engineering and reliability engineering may even have contradicting requirements. This relates to system level architecture choices .[citation needed]For example, in train signal control systems it is common practice to use a fail-safe system design concept. In this concept theWrong-side failureneed to be fully controlled to an extreme low failure rate. These failures are related to possible severe effects, like frontal collisions (2* GREEN lights). Systems are designed in a way that the far majority of failures will simply result in a temporary or total loss of signals or open contacts of relays and generate RED lights for all trains. This is the safe state. All trains are stopped immediately. This fail-safe logic might unfortunately lower the reliability of the system. The reason for this is the higher risk of false tripping as any full or temporary, intermittent failure is quickly latched in a shut-down (safe)state. Different solutions are available for this issue. See chapter Fault Tolerance below.

Fault tolerance[edit]Reliability can be increased here by using a 2oo2 (2 out of 2) redundancy on part or system level, but this does in turn lower the safety levels (more possibilities for Wrong Side and undetected dangerous Failures). Fault tolerant voting systems (e.g. 2oo3 voting logic) can increase both reliability and safety on a system level. In this case the so-called "operational" or "mission" reliability as well as the safety of a system can be increased. This is also common practice in Aerospace systems that need continued availability and donothave a fail safe mode (e.g. flight computers and related electrical and / or mechanical and / or hydraulic steering functions need always to be working. There are no safe fixed positions for rudder or other steering parts when the aircraft is flying).

Basic reliability and mission (operational) reliability[edit]The above example of a 2oo3 fault tolerant system increases both mission reliability as well as safety. However, the "basic" reliability of the system will in this case still be lower than a non redundant (1oo1) or 2oo2 system! Basic reliability refers to all failures, including those that might not result in system failure, but do result in maintenance repair actions, logistic cost, use of spares, etc. For example, the replacement or repair of 1 channel in a 2oo3 voting system that is still operating with one failed channel (which in this state actually has become a 1oo2 system) is contributing to basic unreliability but not mission unreliability. Also, for example, the failure of the taillight of an aircraft is not considered as a mission loss failure, but does contribute to the basic unreliability.

Detectability and common cause failures[edit]When using fault tolerant (redundant architectures) systems or systems that are equipped with protection functions, detectability of failures and avoidance of common cause failures become paramount for safe functioning and/or mission reliability.

Reliability versus Quality (Six Sigma)[edit]Six-Sigma has its roots in manufacturing and Reliability engineering is more related to systems engineering. The systems engineering process is a discovery process that is quite unlike a manufacturing process. A manufacturing process is focused on repetitive activities that achieve high quality outputs with minimum cost and time. The systems engineering process must begin by discovering the real (potential) problem that needs to be solved; the biggest failure that can be made in systems engineering is finding an elegant solution to the wrong problem[22](or in terms of reliability: "providing elegant solutions to the wrong root causes of system failures").

The everyday usage term "quality of a product" is loosely taken to mean its inherent degree of excellence. In industry, this is made more precise by defining quality to be "conformance to requirements at the start of use".Assumingthe product specifications adequately capture customer (or rest of system) needs, the quality level of these parts can now be precisely measured by the fraction of units shipped that meet the detailedproductspecifications.[23]But are (derived, lower level) requirements and related product specifications validated? Will it later result in worn items and systems, by general wear, fatigue or corrosion mechanisms, debris accumulation or due to maintenance induced failures? Are there interactions on any system level (as investigated by for example Fault Tree Analysis)? How many of these systems still meet function and fulfill the needs after a week of operation? What performance losses occurred? Did full system failure occur? What happens after the end of a one year warranty period? And what after 50 years (typical high lifetime of Aircraft, Trains, Nuclear Systems, etc...)? That is where "reliability" comes in.

Quality is a snapshot at the start of life and mainly related to control of lower levelproductspecifications and reliability is (as part of systems engineering) more of a system level motion picture of the day-by-day operation for many years. Time zero defects are manufacturing mistakes that escaped final test (Quality Control). The additional defects that appear over time are "reliability defects" or reliability fallout. These reliability issues may just as well occur due to Inherent design issues, which may have nothing to do with non-conformance product specifications. Items that are produced perfectly - according all product specifications - may fail over time due to any single or combined failure mechanism (e.g. mechanical-, electrical-, chemical- or human error related). All these parameters arealsoa function of all all possible variances coming from initial production. Theoretically,all itemswill functionally fail over infinite time.[24]In theory the Quality level might be described by a single fraction defective. To describe reliability fallout a probability model that describes the fraction fallout over time is needed. This is known as the life distribution model.[23]Quality is therefore related to Manufacturing and Reliability is more related to the validation of sub-system or lower item requirements, (System or Part)inherent Design and life cycle solutions. Items that do not conform to (any) product specification in general will do worse in terms of reliability (having a lower MTTF), but this does not always have to be the case. The full mathematical Quantification (in statistical models) of this combined relation is in general very difficult or even practical impossible.In casemanufacturing variances can be effectively reduced, six sigma tools may be used to find optimal process solutions and may thereby also increase reliability. Six Sigma may also help to design more robust related to manufacturing induced failures.

In contrast with Six Sigma, Reliability Engineering Solutions are generally found by having a focus into a (system)design and not on the manufacturing process. Solutions are found in different ways, for example by simplifying a system and therefore understanding more mechanisms of failure involved, detailed calculation of material stress levels and required safety factors, finding possible abnormal system load conditions and next to this also to increase design robustness against variation from the manufacturing variances and related failure mechanisms. Furthermore reliability engineering use system level solutions, like designing redundancy and fault tolerant systems in case of high availability needs (see chapter Reliability engineering vs Safety engineering above).

Next to this and also in a major contrast with Reliability Engineering, Six-Sigma is much more measurement based (quantification). The core of Six-Sigma thrives on empirical research and statistics where it is possible to measure parameters (e.g. to find transfer functions). This can not be translated practically to most reliability issues, as reliability is not (easy) measurable due to the function of time (large times may be involved), specially during the requirements specification and design phase where reliability engineering is the most efficient. Full Quantification of reliability is in this phase extremely difficult or costly (testing). It also may foster re-active management (waiting for system failures to be measured). Furthermore, as explained on this page, Reliability problems are likely to come from many different (e.g. inherent failures, human error, systematic failures) causes besides manufacturing induced defects.

Quality (manufacturing) Six Sigma and Reliability (design) departments should provide input to each other to cover the complete risks more efficiently.

Reliability operational assessment[edit]After a system is produced, reliability engineering monitors, assesses and corrects deficiencies. Monitoring includes electronic and visual surveillance of critical parameters identified during the fault tree analysis design stage. Data collection is highly dependent on the nature of the system. Most large organizations havequality controlgroups that collect failure data on vehicles, equipment and machinery. Consumer product failures are often tracked by the number of returns. For systems in dormant storage or on standby, it is necessary to establish a formal surveillance program to inspect and test random samples. Any changes to the system, such as field upgrades or recall repairs, require additional reliability testing to ensure the reliability of the modification. Since it is not possible to anticipate all the failure modes of a given system, especially ones with a human element, failures will occur. The reliability program also includes a systematicroot cause analysisthat identifies the causal relationships involved in the failure such that effective corrective actions may be implemented. When possible, system failures and corrective actions are reported to the reliability engineering organization.

One of the most common methods to apply to a reliability operational assessment arefailure reporting, analysis, and corrective action systems(FRACAS). This systematic approach develops a reliability, safety and logistics assessment based on Failure / Incident reporting, management, analysis and corrective/preventive actions. Organizations today are adopting this method and utilize commercial systems such as a Web based FRACAS application enabling an organization to create a failure/incident data repository from which statistics can be derived to view accurate and genuine reliability, safety and quality performances.

It is extremely important to have one common source FRACAS system for all end items. Also, test results should be able to be captured here in a practical way. Failure to adopt one easy to handle (easy data entry for field engineers and repair shop engineers)and maintain integrated system is likely to result in a FRACAS program failure.

Some of the common outputs from a FRACAS system includes: Field MTBF, MTTR, Spares Consumption, Reliability Growth, Failure/Incidents distribution by type, location, part no., serial no, symptom etc.

The use of past data to predict the reliability of new comparable systems/items can be misleading as reliability is a function of the context of use and can be affected by small changes in the designs/manufacturing.

Reliability organizations[edit]Systems of any significant complexity are developed by organizations of people, such as a commercialcompanyor agovernmentagency. The reliability engineering organization must be consistent with the company'sorganizational structure. For small, non-critical systems, reliability engineering may be informal. As complexity grows, the need arises for a formal reliability function. Because reliability is important to the customer, the customer may even specify certain aspects of the reliability organization.

There are several common types of reliability organizations. The project manager or chief engineer may employ one or more reliability engineers directly. In larger organizations, there is usually a product assurance orspecialty engineeringorganization, which may include reliability,maintainability,quality, safety,human factors,logistics, etc. In such case, the reliability engineer reports to the product assurance manager or specialty engineering manager.

In some cases, a company may wish to establish an independent reliability organization. This is desirable to ensure that the system reliability, which is often expensive and time consuming, is not unduly slighted due to budget and schedule pressures. In such cases, the reliability engineer works for the project day-to-day, but is actually employed and paid by a separate organization within the company.

Because reliability engineering is critical to early system design, it has become common for reliability engineers, however the organization is structured, to work as part of anintegrated product team.

Reliability engineering education[edit]Some universities offer graduate degrees in reliability engineering. Other reliability engineers typically have an engineering degree, which can be in any field of engineering, from an accredited university or college program. Many engineering programs offer reliability courses, and some universities have entire reliability engineering programs. A reliability engineer may be registered as a professional engineer by the state, but this is not required by most employers. There are many professional conferences and industry training programs available for reliability engineers. Several professional organizations exist for reliability engineers, including theIEEE Reliability Society, theAmerican Society for Quality (ASQ), and theSociety of Reliability Engineers (SRE).What is MIL-HDBK-217?The original reliability prediction handbook was MIL-HDBK-217, the Military Handbook for "Reliability Prediction of Electronic Equipment". MIL-HDBK-217 is published by the Department of Defense, based on work done by the Reliability Analysis Center and Rome Laboratory at Griffiss AFB, NY.

The MIL-HDBK-217 handbook contains failure rate models for the various part types used in electronic systems, such as ICs, transistors, diodes, resistors, capacitors, relays, switches, connectors, etc. These failure rate models are based on the best field data that could be obtained for a wide variety of parts and systems; this data is then analyzed and massaged, with many simplifying assumptions thrown in, to create usable models.

The latest version of MIL-HDBK-217 is MIL-HDBK-217F, Notice 2 (217F2). You can get a copy of MIL-HDBK-217F2 from any source that provides Mil Specs, Mil Standards, Mil Handbooks, etc. Try the Defense Printing Service, Philadelphia, PA, Phone: (215) 697-2179, Fax: (215) 697-1462.

Documents

Reliability