Improvement of System Reliability and Failure Avoidance

7/30/2019 Improvement of System Reliability and Failure Avoidance

1/37

1

A project work report submitted

to

For partial fulfillment of the requirement for the

Award of the degree

of

In

By

Under the guidance of


2/37


3/37

3

ACKNOWLEDGEMENT

I would like to take the opportunity to extend my sincere

thanks and gratitude to Dr.S.B.PRASAD our project

supervisor for providing his assistance and co-ordination

during the development of the report.

I am thankful to Dr.A.M.Tigga , Head of Department,Production and Industrial Engg. for his constant

encouragement and valuable suggestions throughout this

project.

Finally we are grateful to all the faculty of the Department

of Production and Industrial Engg., N.I.T. Jamshedpur for

their encouragement and inspiration.

By

SUMIT KUMAR JHA(308/06)SYED SARIM HUSSAIN (322/06)ROHIT SURIN (251/06)


4/37

4

CONTENTSPAGE

CERTIFICATE.................................................................................. 2

ACKNOWLEDGEMENT................. 3

ABSTRACT... 5

CHAPTER 1 INTRODUCTION

1.1 MOTIVATION-RELIABILITY AND SYSTEMS ENGG. 6

CHAPTER 2 LITERATURE REVIEW

2.1 RELIABILITY THEORY.. 11

CHAPTER 3 REVIEW OF RELATED WORK 17

CHAPTER 4 FOUR STRATEGIES FOR IMPROVED ROBUSTNESS 21

CHAPTER 5 SUMMARY 36

REFERENCES.. 37


5/37

5

ABSTRACT

To be reliable, a system must be robustit must avoid failure modes even in the

presence of a broad range of conditions including harsh environments, changing

operational demands, and internal deterioration. This project discusses and

codifies techniques for robust system design that operate by expanding the range

of conditions under which the system functions.

A distinction is introduced between one-sided and two-sided failure modes, and

four strategies are presented for creating larger windows between sets of one-

sided failure modes. Each strategy is illustrated through two examples from

industrial practice. For each strategy, one example is from paper handling and

another is from jet engines. By showing that every strategy has been successfully

applied to each system, we seek to illustrate that the strategies are widely

applicable and highly effective.

Key words: reliability; robust design; operating window; system architecture


6/37

6

INTRODUCTION

Reliability may be defined in several ways:

* The idea that something is fit for a purpose with respect to time;

* The capacity of a device or system to perform as designed;

* The resistance to failure of a device or system;

* The ability of a device or system to perform a required function under

stated conditions for a specified period of time;

* The probability that a functional unit will perform its required function

for a specified interval under stated conditions.

* The ability of something to "fail well" (fail without catastrophic

consequences)

MOTIVATION: RELIABILITY AND SYSTEMS ENGINEERING

Reliability engineers rely heavily on statistics, probability theory, and reliability

theory. Many engineering techniques are used in reliability engineering, such

as reliability prediction, Weibull analysis, thermal management, reliability

testing and accelerated life testing. Because of the large number of reliability

techniques, their expense, and the varying degrees of reliability required for

different situations, most projects develop a reliability program plan to specify

the reliability tasks that will be performed for that specific system.

The function of reliability engineering is to develop the reliability requirements

for the product, establish an adequate reliability program, and perform

appropriate analyses and tasks to ensure the product will meet its

requirements. These tasks are managed by a reliability engineer, who usually

holds an accredited engineering degree and has additional reliability-specific


7/37

7

education and training. Reliability engineering is closely associated with

maintainability engineering and logistics engineering. Many problems from

other fields, such as security engineering, can also be approached using

reliability engineering techniques. This article provides an overview of some of

the most common reliability engineering tasks. Please see the references for a

more comprehensive treatment.

Reliability is among the most important topics in systems engineering.

Reliability is the proper functioning of the system under the full range of

conditions experienced in the field. Reliability requires two critical conditions:

Mistake avoidance

Robustness

By mistake we refer to the plethora of design decisions and

manufacturing operations that may be grossly in error. Examples of mistakes

are installing a switch backwards, or interpreting a software command as being

expressed in inches when it represents centimeters. Reliability can be

improved by reducing the incidence of such mistakes by a combination of

knowledge-based engineering and the problem-solving process.

By robustness we refer to the ability of a system to function (i.e., to avoid

failure) under the full range of conditions that may be experienced in the field. It

is one sort of challenge to develop a system that functions for a demonstration

under tightly controlled conditions such as in a laboratory. It is an entirely

different challenge to make a system that functions reliably throughout its

lifecycle as it experiences a broad set of real world environmental and operating

conditions. Effective systems engineering is the second challenge, not the first

one.

Many types of engineering employ reliability engineers and use the tools and

methodology of reliability engineering. For example:

* System engineers design complex systems having a specified reliability


8/37

8

* Mechanical engineers may have to design a machine or system with a specified

reliability

* Automotive engineers have reliability requirements for the automobiles (and

components) which they design

* Electronics engineers must design and test their products for reliability

requirements.

* In software engineering and systems engineering the reliability engineering is

the subdiscipline of ensuring that a system (or a device in general) will perform its

intended function(s) when operated in a specified manner for a specified length

of time. Reliability engineering is performed throughout the entire life cycle of a

system, including development, test, production and operation....

An alternative conception of reliability engineering is based on what we call

Failure-mode avoidance.Many changes in system design that improve reliability

do so by moving the physical failure modes. In fact, we argue that the most

significant improvements in reliability come about by this means. Although this

approach can be integrated with probability theory, it is not necessary to use

probability theory to understand how these design changes bring about their

effects.

We claim that, especially in the early development of systems, the Failure-mode

avoidance approach will lead to many improvements being made with a mini-

mum amount of data requiredjust enough to guide the next improvement. The

Failure-mode avoidance approach is deeply rooted in the physics of the system

and is therefore tangible to the engineers, which facilitates the needed creative

insights for concept design. This advantage is supported by recent results from

cognitive psychology.

A further advantage of the Failure-mode avoidance approach is that it reduces

the salience of so-called specified operating conditions. At an early stage of

system development, one cannot reasonably define a complete set of conditions

that a system is likely to experience in its lifecycle.


9/37

9

Although an approximate set of conditions can be defined, it will surely miss some

important combinations of conditions. Later on, these unanticipated operating

conditions may arise and the system may cease to function. When this happens, it

is tempting to say that, since the condition was not specified, the system did not

actually failthat the system was misused. It is essential for systems engineers to

recognize that nature does not care what systems engineers think the specified

operating conditions are. When the system fails to function under the conditionsthe system actually experiences, that constitutes a failure. This point is well

understood by some reliability engineers. For example, Thomas, Ayers, and Pecht

[2002] discuss trouble not identified warrantee returns in the auto industry and

conclude: .It must not be assumed that a returned module that passes tests

associated with an engineering specification is good,. Because of uncertainty

regarding specified operating conditions, we argue that an effective approach is

to increase the set of conditions under which the system operates and do this as

quickly and economically as one can manage within the time available. This

implies that systems engineers should not spend much energy on predicting field

reliability but instead use that same energy to increase field reliability [Clausing,

1994].It seems that the creative design work that leads to reliability improvement

is a very natural activity and is consistent with our failure-mode avoidance


10/37

10

conception of reliability. We propose that thinking of reliability as failure-mode

avoidance can have real advantages, especially in the early stages of system

design or in a long-term scenario such as technology development. In early stages

of system design, probability theory may be too quantitative for the task at hand.

Probability density functions imply a level of precision in modeling the scenario

that is often unwarranted, especially during early development. As a project

advances through its development stages the probabilistic view of reliability

becomes increasingly useful. Analysis of reliability using probability theory is

useful for component selection, system validation, and the management of field-

service operations. The value of the failure mode avoidance conception of

reliability is greatest for technology strategy, systems architecting, concept

design, and for some robust parameter design activities, all done early during the

development of the system.


11/37

11

2. RELIABILITY THEORY

Reliability theory is the foundation of reliability engineering. For engineering

purposes, reliability is defined as:

theprobabilitythat a device will perform its intended function during a

specified period of time under stated conditions.

Mathematically, this may be expressed as,

,

where is the failureprobability density functionand tis the length ofthe period of time (which is assumed to start from time zero).

Reliability engineering is concerned with four key elements of this definition:

First, reliability is a probability. This means that failure is regarded as

arandomphenomenon: it is a recurring event, and we do not

express any information on individual failures, the causes of failures,

or relationships between failures, except that the likelihood for

failures to occur varies over time according to the given probabilityfunction. Reliability engineering is concerned with meeting the

specified probability of success, at a specified statisticalconfidence

level.

Second, reliability is predicated on "intended function:" Generally,

this is taken to mean operation withoutfailure. However, even if no

individual part of the system fails, but the system as a whole does

not do what was intended, then it is still charged against the system

reliability. The system requirements specification is the criterion

against which reliability is measured. Third, reliability applies to a specified period of time. In practical

terms, this means that a system has a specified chance that it will

operate without failure before time . Reliability engineering ensures

that components and materials will meet the requirements during

the specified time. Units other than time may sometimes be used.
http://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Probability_density_functionhttp://en.wikipedia.org/wiki/Probability_density_functionhttp://en.wikipedia.org/wiki/Probability_density_functionhttp://en.wikipedia.org/wiki/Randomhttp://en.wikipedia.org/wiki/Randomhttp://en.wikipedia.org/wiki/Randomhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Randomhttp://en.wikipedia.org/wiki/Probability_density_functionhttp://en.wikipedia.org/wiki/Probability


12/37

12

The automotive industry might specify reliability in terms of miles,

the military might specify reliability of a gun for a certain number of

rounds fired. A piece of mechanical equipment may have a reliability

rating value in terms of cycles of use.

Fourth, reliability is restricted to operation under stated conditions.This constraint is necessary because it is impossible to design a

system for unlimited conditions. AMars Roverwill have different

specified conditions than the family car. The operating environment

must be addressed during design and testing. Also, that same rover,

may be required to operate in varying conditions requiring additional

scrutiny.

Reliability program plan

Many tasks, methods, and tools can be used to achieve reliability. Every system

requires a different level of reliability. A commercialairlinermust operate under a

wide range of conditions. The consequences of failure are grave, but there is a

correspondingly higher budget. A pencil sharpener may be more reliable than an

airliner, but has a much different set of operational conditions, insignificant

consequences of failure, and a much lower budget.

A reliability program plan is used to document exactly what tasks, methods, tools,

analyses, and tests are required for a particular system. For complex systems, the

reliability program plan is a separatedocument. For simple systems, it may be

combined with thesystems engineeringmanagement plan orintegrated Logistics

Supportmanagement plan. The reliability program plan is essential for a

successful reliability program and is developed early during system development.

It specifies not only what the reliability engineer does, but also the tasks

performed by others. The reliability program plan is approved by top program

management.

Reliability requirementsFor any system, one of the first tasks of reliability engineering is to adequately

specify the reliability requirements. Reliability requirements address the system

itself, test and assessment requirements, and associated tasks and

documentation. Reliability requirements are included in the appropriate
http://en.wikipedia.org/wiki/Mars_Roverhttp://en.wikipedia.org/wiki/Mars_Roverhttp://en.wikipedia.org/wiki/Mars_Roverhttp://en.wikipedia.org/wiki/Airlinerhttp://en.wikipedia.org/wiki/Airlinerhttp://en.wikipedia.org/wiki/Airlinerhttp://en.wikipedia.org/wiki/Documenthttp://en.wikipedia.org/wiki/Documenthttp://en.wikipedia.org/wiki/Documenthttp://en.wikipedia.org/wiki/Systems_engineeringhttp://en.wikipedia.org/wiki/Systems_engineeringhttp://en.wikipedia.org/wiki/Systems_engineeringhttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Integrated_Logistics_Supporthttp://en.wikipedia.org/wiki/Systems_engineeringhttp://en.wikipedia.org/wiki/Documenthttp://en.wikipedia.org/wiki/Airlinerhttp://en.wikipedia.org/wiki/Mars_Rover


13/37

13

system/subsystem requirements specifications, test plans, and contract

statements.

Design for reliability

Design For Reliability (DFR), is an emerging discipline that refers to the process of

designing reliability into products. This process encompasses several tools and

practices and describes the order of their deployment that an organization needs

to have in place to drive reliability into their products. Typically, the first step in

the DFR process is to set the systems reliability requirements. Reliability must be

"designed in" to the system. During systemdesign, the top-level reliability

requirements are then allocated to subsystems by design engineers and reliability

engineers working together.

Reliability design begins with the development of amodel. Reliability models use

block diagrams and fault trees to provide a graphical means of evaluating the

relationships between different parts of the system. These models incorporate

predictions based on parts-count failure rates taken from historical data. While

the predictions are often not accurate in an absolute sense, they are valuable to

assess relative differences in design alternatives.

A FAULT TREE DIAGRAM

One of the most important design techniques isredundancy. This means that if

one part of the system fails, there is an alternate success path, such as a backup

system. An automobile brake light might use two light bulbs. If one bulb fails, the

brake light still operates using the other bulb. Redundancy significantly increases
http://en.wikipedia.org/wiki/Designhttp://en.wikipedia.org/wiki/Designhttp://en.wikipedia.org/wiki/Designhttp://en.wikipedia.org/wiki/Mathematical_modelhttp://en.wikipedia.org/wiki/Mathematical_modelhttp://en.wikipedia.org/wiki/Mathematical_modelhttp://en.wikipedia.org/wiki/Redundancy_%28engineering%29http://en.wikipedia.org/wiki/Redundancy_%28engineering%29http://en.wikipedia.org/wiki/Redundancy_%28engineering%29http://en.wikipedia.org/wiki/File:Fault_tree.pnghttp://en.wikipedia.org/wiki/Redundancy_%28engineering%29http://en.wikipedia.org/wiki/Mathematical_modelhttp://en.wikipedia.org/wiki/Design


14/37

14

system reliability, and is often the only viable means of doing so. However,

redundancy is difficult and expensive, and is therefore limited to critical parts of

the system. Another design technique, physics of failure, relies on understanding

the physical processes of stress, strength and failure at a very detailed level. Then

the material or component can be re-designed to reduce the probability offailure. Another common design technique is componentderating: Selecting

components whose tolerance significantly exceeds the expected stress, as using a

heavier gauge wire that exceeds the normal specification for the expected

electrical current.

Many tasks, techniques and analyses are specific to particular industries and

applications. Commonly these include:

Built-in test (BIT)

Failure mode and effects analysis(FMEA)

Reliability simulation modeling

Thermal analysis

Reliability Block Diagram analysis

Fault tree analysis

Root cause analysis

Sneak circuit analysis

Accelerated Testing

Reliability Growth analysis

Weibullanalysis

Electromagnetic analysis

Statistical interference

AvoidSingle Point of Failure

Results are presented during the system design reviews and logistics reviews.

Reliability is just one requirement among many system requirements. Engineering

trade studies are used to determine theoptimumbalance between reliability and

other requirements and constraints.

Reliability testing
http://en.wikipedia.org/wiki/Deratinghttp://en.wikipedia.org/wiki/Deratinghttp://en.wikipedia.org/wiki/Deratinghttp://en.wikipedia.org/wiki/Electrical_currenthttp://en.wikipedia.org/wiki/Electrical_currenthttp://en.wikipedia.org/wiki/Failure_mode_and_effects_analysishttp://en.wikipedia.org/wiki/Failure_mode_and_effects_analysishttp://en.wikipedia.org/wiki/Thermal_analysishttp://en.wikipedia.org/wiki/Thermal_analysishttp://en.wikipedia.org/wiki/Fault_tree_analysishttp://en.wikipedia.org/wiki/Fault_tree_analysishttp://en.wikipedia.org/wiki/Root_cause_analysishttp://en.wikipedia.org/wiki/Root_cause_analysishttp://en.wikipedia.org/wiki/Weibull_distributionhttp://en.wikipedia.org/wiki/Weibull_distributionhttp://en.wikipedia.org/wiki/Statistical_interferencehttp://en.wikipedia.org/wiki/Statistical_interferencehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Optimization_%28mathematics%29http://en.wikipedia.org/wiki/Optimization_%28mathematics%29http://en.wikipedia.org/wiki/Optimization_%28mathematics%29http://en.wikipedia.org/wiki/Optimization_%28mathematics%29http://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Statistical_interferencehttp://en.wikipedia.org/wiki/Weibull_distributionhttp://en.wikipedia.org/wiki/Root_cause_analysishttp://en.wikipedia.org/wiki/Fault_tree_analysishttp://en.wikipedia.org/wiki/Thermal_analysishttp://en.wikipedia.org/wiki/Failure_mode_and_effects_analysishttp://en.wikipedia.org/wiki/Electrical_currenthttp://en.wikipedia.org/wiki/Derating


15/37

15

The purpose of reliability testing is to discover potential problems with the design

as early as possible and, ultimately, provide confidence that the system meets its

reliability requirements.

Reliability testing may be performed at several levels. Complex systems may betested at component, circuit board, unit, assembly, subsystem and system levels.

(The test level nomenclature varies among applications.) For example, performing

environmental stress screening tests at lower levels, such as piece parts or small

assemblies, catches problems before they cause failures at higher levels. Testing

proceeds during each level of integration through full-up system testing,

developmental testing, and operational testing, thereby reducing program risk.

System reliability is calculated at each test level. Reliability growth techniques and

failure reporting, analysis and corrective active systems (FRACAS) are often

employed to improve reliability as testing progresses. The drawbacks to suchextensive testing are time and expense.Customersmay choose to accept more

riskby eliminating some or all lower levels of testing.

It is not always feasible to test all system requirements. Some systems are

prohibitively expensive to test; somefailure modesmay take years to observe;

some complex interactions result in a huge number of possible test cases; and

some tests require the use of limited test ranges or other resources. In such cases,

different approaches to testing can be used, such as accelerated life testing,

design of experiments, andsimulations.

The desired level of statistical confidence also plays an important role in reliability

testing. Statistical confidence is increased by increasing either the test time or the

number of items tested. Reliability test plans are designed to achieve the
http://en.wikipedia.org/w/index.php?title=Customer_Value&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Customer_Value&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Customer_Value&action=edit&redlink=1http://en.wikipedia.org/wiki/Riskhttp://en.wikipedia.org/wiki/Riskhttp://en.wikipedia.org/wiki/Failure_modehttp://en.wikipedia.org/wiki/Failure_modehttp://en.wikipedia.org/wiki/Failure_modehttp://en.wikipedia.org/wiki/Design_of_experimentshttp://en.wikipedia.org/wiki/Design_of_experimentshttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Simulationhttp://en.wikipedia.org/wiki/Design_of_experimentshttp://en.wikipedia.org/wiki/Failure_modehttp://en.wikipedia.org/wiki/Riskhttp://en.wikipedia.org/w/index.php?title=Customer_Value&action=edit&redlink=1


16/37

16

specified reliability at the specifiedconfidence levelwith the minimum number of

test units and test time. Different test plans result in different levels of risk to the

producer and consumer. The desired reliability, statistical confidence, and risk

levels for each side influence the ultimate test plan. Good test requirements

ensure that the customer and developer agree in advance on how reliabilityrequirements will be tested.

A key aspect of reliability testing is to define "failure". Although this may seem

obvious, there are many situations where it is not clear whether a failure is really

the fault of the system. Variations in test conditions, operator differences,

weather, and unexpected situations create differences between the customer and

the system developer. One strategy to address this issue is to use a scoring

conference process. A scoring conference includes representatives from the

customer, the developer, the test organization, the reliability organization, andsometimes independent observers. The scoring conference process is defined in

the statement of work. Each test case is considered by the group and "scored" as

a success or failure. This scoring is the official result used by the reliability

engineer.

As part of the requirements phase, the reliability engineer develops a test

strategy with the customer. The test strategy makes trade-offs between the

needs of the reliability organization, which wants as much data as possible, and

constraints such as cost, schedule, and available resources. Test plans and

procedures are developed for each reliability test, and results are documented in

official reports.
http://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Confidence_intervalhttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Weatherhttp://en.wikipedia.org/wiki/Weatherhttp://en.wikipedia.org/wiki/Weatherhttp://en.wikipedia.org/wiki/Failurehttp://en.wikipedia.org/wiki/Confidence_interval


17/37

17

3. REVIEW OF RELATED WORK

This project is intended to help engineers with the early-stage, conceptual phase

of design. Therefore, an important related development is the Theory of Inventive

Problem Solving (sometimes described by the acronyms TRIZ or TIPS). The theory

was first described by Altschuller [1984] and was recently placed in a broader

context of innovation by Clausing and Fey [2004]. The theory is based on a study

of thousands of patents that revealed patterns among inventive solutions. An

important underlying hypothesis is that inventive problems can be viewed as

conflicts which the inventive solutions resolve. This enabled large numbers of

patents to be organized in a useful taxonomy. It has also given rise to commercial

software products that facilitate the use of the theory by professional

practitioners. However, we note that many patents claim robustness as their

primary advantagethey do not deliver new functions, but deliver existing

functions over a broader range of conditions. While TRIZ is helpful in

development of new functions and elimination of harmful side effects, it does notseem to support reliability innovations to the extent we desire. Therefore, this

paper analyzes patents and seeks new patterns of inventive engineering work.

A development in reliability engineering closely related to this project is the

physics-of-failure (PoF) approach developed at the Computer Aided Life Cycle

Engineering (CALCE) Electronic Products and Systems Center at the University of

Maryland. The first instance in archival literature of the term physics of failure is

Pecht et al. [1990], which emphasizes use of a physics-based model for reliability

prediction and design for reliability. This approach has been extended to product

development by Pecht and Desgupta [1995] and to accelerated life testing by

Kimseng et al. [1999].This paper builds upon the conception of physics-offailure

and seeks to extend this conception to the earliest, creative phases of system

design.


18/37

18

An important development in reliability engineering is robust parameter design

pioneered by Genichi Taguchi [Taguchi, 1993]. For any design concept, there is a

potentially large space of control factor settings that will nominally place the

function at the desired target value. In robust parameter design, the engineer

explores the design space seeking changes that will make the system more robust

while still keeping the performance on target. Taguchis method employs

orthogonal arrays to explore the design space. At the same time, outer arrays or

compounded noises are used to explore the range of possible operating

conditions. Signal to noise ratios are used as measures of the robustness of the

system and guide the engineer to preferable levels of the control factors.

Taguchis philosophy of robust design is consistent with the approach to reliability

engineering discussed here. Taguchi rejected the goal post mentality inherent intolerance limits and specifications. His notion of a quality-loss function replaced

consideration of defect rates and process yields with an emphasis on reducing

variance followed by adjustment to target. Taguchi encouraged engineers to

deliberately expose designs to harsh conditions in experiments. To do this

requires a transformation in the culture of an engineering organization. The

emphasis must shift from demonstrating adequate performance with high

statistical confidence to aggressive improvement followed by adequate con-

firmation.

Robust parameter design is among the most important developments in systems

engineering in the 20th

century. These methods seem to have accounted for a

significant part of the quality differential that made Japanese manufacturing so

dominant during the 1970s.The methods were subsequently adopted outside of

Japan. The timing of that adoption in the West corresponded closely with

improvement in quality that improved competitiveness of North American and

European manufacturers. Robust design methods were surely a significant part ofboth the rise of Japanese industry and the response to that competitive

challenge. Robust design methods have continued to be refined and are still an

active area of systems engineering innovation.


19/37

19

Another approach relevant to this paper known as operating window methods

was developed and practiced at Xerox Corporation in the 1970s. The operating

window is the set of conditions under which the system operates without failure.

In operating window methods, reliability is improved by making the operating

window larger. Clausing [2004] described the approach in detail in a recent issue

of Technometrics, but the essence of the approach is simple enough to present

here:

1. Increase the value of the noise factors so that the failure rate is high.

2. Change the value of the control factors to seek a broader operating window at

a fixed failure rate.

This approach was used, for example, to improve the reliability of paper handling

machines. At Xerox, paper stacks were designed and constructed to deliberately

produce a large magnitude of variation. The papers varied in their weight, surface

condition, geometry, and so on. These paper stacks were similar to the worst

stacks one would encounter in field use, and, in con- junction with operation near

the limit of the operating window, they brought about higher failure rates than

would normally be encountered, on the order of 1 in 10 rather than 1 in 10,000.

These high failure rates enabled the engineers to more quickly discern the effect

of changes in failure rate with changes in the control factors such as stack forces,

feed belt angles, and so on. This approach worried managers since they observed

the machines jamming with high frequency, but they eventually came to

understand why this was needed. As a consequence the engineers were able to

quickly converge to more reliable machine configurations.

Despite the use of failure rates as a measure of performance, the operating

window method is, upon closer examination, consistent with Taguchis quality

philosophy. Because failure rates were greatly increased by applying aggressivenoises, improvements could be made rapidly, even though they sacrificed the

ability to accurately predict field reliability. The term operating window may

seem to imply an emphasis on goal posts, but in fact the customer-specified


20/37

20

limits are viewed as irrelevant and the expansion of actual physical limits is valued

instead.

Operating window methods continue to be an active area of research in quality

engineering. Joseph and Wu [2004] showed that under certain conditions a failurerate of 50% maximizes the information gained from robust design using an

operating window. As an example, they carried out a case study wherein line

width in a lithography process set at a much finer pitch than actually needed in

practice. The control factor settings that improved the robustness at the finer

pitch also improved the robustness at the pitch needed in operation. The basic

concept of operating windows was therefore further corroborated.

While retaining the benefits of Taguchis quality philosophy, operating window

methods may have a further advantage. In operating window methods, the

progress in reliability is measured in physical terms by the size of the operating

window. This may be preferable to measuring results with a more abstract

measure such as signal to noise ratios. For example, operating window methods

encouraged engineers at Xerox to devise ways to double the range of paper

weights the machine could feed rather than contemplate how to increase signal

to noise ratios by 6 decibels. As previously discussed, cognitive psychology

suggests there is an advantage in maintaining a connection to physical quantitiesrather than probabilistic measures. We propose that a mental connection to the

physics and logic of the system is even more critical for early stage system design

than it is for later stage parameter design.

As discussed in this section, the basic concept of operating windows is to seek a

larger set of conditions under which the system functions. While the idea is very

simple, implementation is challenging, requiring deep knowledge of the system

and the creativity to develop the needed design innovations. This paper seeks to

help engineers implement early stage robustness work via operating window

methods. The next section covers some theoretical developments. The sub-

sequent sections present specific strategies for implementation.


21/37

21

4.FOUR STRATEGIES FOR IMPROVED ROBUSTNESS

Up to this point, this paper has focused on the interrelated concepts of reliability,

robustness, and one-sided failure modes. From this point forward, the paper con-

centrates on strategies to avoid one-sided failure modes. All of these strategiesinvolve concept design rather than parameter design. The design changes

considered here are not only changes in the values of design parameters but also

additions of new features or components, changes in the configuration of the

system, or even new inventions. We present four strategies along these lines:

1. Relax a constraint limit on an uncoupled control factor.

2. Use physics of incipient failure to avoid failure.

3. Create two distinct operating modes for two different demand conditions.

4. Exploit interdependence between two operating window system variables.

To illustrate these strategies and demonstrate their versatility, we present two

different example applications of each strategy, a primary example that is

described in considerable detail and a supplementary example that is described in

less detail. Two engineering domains are used throughoutpaper feeders and jet

engines. The next four subsections present these strategies.

4.1. Relax a Constraint Limit on an Uncoupled Control Factor

A control factor that affects only one of the one-sided failure modes in a system is

said to be uncoupled as defined in Section 3. Such control factors should be

maximized or minimized to create the greatest possible distance from the

affected one-sided failure mode consistent with any constraints on the control

factor. As the system is placed under greater demands over time due to system

evolution and competition, the operating window afforded under the currentsystem constraints may become insufficient. Under these circumstances, the

constraint can often be relaxed by making changes in the system architecture or

by changes in technology.The relaxed constraint enables further changes to the

uncoupled control factor, which opens the operating window.


22/37

22

Primary Case StudyPaper Feeder. As an industrial example, we present the

Xerox paper feeder that first went into production in 1981, and has appeared in

many different Xerox copiers and printers. This paper feeder is known as a

friction-retard feeder (Fig. 5).

The feedbelt rests on the paper stack, and drags the top sheet forward. The

friction of the retard roll holds back (retards) the second sheet if it tries to come

through. Thus, the retard roll prevents multifeeds (feeding of more than one

sheet). Therefore, the wrap angle between the feedbelt and the retard roll only

affects the failure mode of multifeeds. The other primary failure mode is misfeeds

(no sheet is fed). This failure mode is not affected by the wrap angle between the

feedbelt and the retard roll. Because multifeeds are reduced by a large wrap

angle and misfeeds are unaffected, it is clear that the wrap angle should be as bigas possible.

Despite the desirability of having a large wrap angle, the previous-generation

feeder (ca. 1975) had a wrap angle of only 13, which was constrained by the

system architecture. In the new design that first went into production in 1981 the

wrap angle was increased to 45. This large improvement in wrap angle was

enabled by a change in the total system architecture. In large copiers and printers

the next subsystem after the paper feeder is the registration subsystem, which


23/37

23

aligns the sheet with the image. In the new design the architecture was changed

so that the paper came out of the feeder and turned down to reach the

registration subsystem (Fig. 6), which was underneath the feeder. This enabledthe wrap angle to be greatly increased. This architecturealso reduced the width of

the copier/printer, which is desirable. This paper feeder with the large wrap angle

has been very successful in many generations of Xerox copiers and printers.

Supplementary Case StudyJet Engines. A similar approach was used to improve

the reliability of axial-flow fans in jet engines. A fan is a component of modern

high by-pass commercial jet engines that provides a significant increase in the

total mass flow, and therefore improvement in propulsive efficiency. A critical

failure mode of such fans is flutter vibration due to the length of the blades and

their exposure to inlet flow distortions. It had long been known that increasing

the chord of a fan blade stiffened the blade and thereby reduced the incidence of

the failure mode of flutter, but the chord of the blade was limited by constraints

on weight [Koff, 2004]. Eventually, new technologies for manufacturing hollow

blades enabled engine manufacturers to increase chords significantly without

added weight. For example both Patent #4,345,877 [Monroe, 1980] and Patent

#4,720,244 [Kluppel and Monroe, 1987] contributed to these advances. Wide-

chord fans provided much greater resistance to flutter and have thereby greatly

improved engine reliability. As in the case of wrap angles in paper feeders,

innovation enabled a critical parameter to be pushed past its previous constraints

to move a one-sided failure-mode boundary and increase the operating window.

Summary of the Strategy. When a system variable only affects one of the one-


24/37

24

sided failure modes, take its value to its constraint limit. If the operating window

is still not large enough, seek new architectures or technologies that relax the

constraint.

4.2. Use Physics of Incipient Failure To Avoid Failure

In some systems the physics of the incipient failure can be used to prevent or

delay the failure mode. All one sided failure modes are associated with underlying

physical phenomena. In many cases the failure mode exhibits distinct physical

mechanisms that become active as the onset of the failure mode is approached.

In some systems there exists an opportunity to exploit the physics of incipient

failure to increase the size of the operating window.

Primary Case Study

Jet Engines. An example is afforded by the use of shaped

grooves in compressor casings in modern jet engines. An axial flow compressor is

comprised of multiple alternating stages of rotor assemblies and stators. To limit

engine complexity and weight, a large pressure rise per stage is desired so that

the desired pressure rise in the compressor can be accomplished with a small

number of stages. However, the pressure increase of each stage is limited by a

failure mode of aerodynamic stall and surge. A stall involves separation of airflow

from a blade, which at any given time may affect only one stage or even a group

of stages


25/37

25

A compressor surge generally refers to a complete flow breakdown throughout

the compressor. The value of airflow and pressure ratio at which a surge occurs is

termed the surge point and surge margin is a term for the difference between

the airflow and compression ratio at which it will normally be operated and the

airflow and compression ratio at which a surge will occur. Thus, we can readily

interpret surge margin as the distance from the one-sided failure mode of com-pressor surge.

In the late 1970s new technologies known as casing treatments were

developed. In one casing treatment technology assigned to Rolls Royce, Patent

#4,086,022 [Freeman and Moritz, 1978], a series of angled channels are placed in

the casing of the compressor extending from the leading edge of the rotors and

extending just aft of the trailing edge (see Fig. 7). If a surge begins to occur, then

a rotating annulus of pressurized gas will begin to build up about the tips of the

blades. Because of the geometry of the slots, the annulus of air will be directed

into the slots thus reducing or eliminating the surge *Freeman and Moritz,

1978, p. 5].


26/37

26

To understand how the casing treatments are related to the operating window, it

is useful to consider Figure 8 adapted from Cumpsty [1997]. The abscissa in the

figure is mass flow of air into the engine. The mass flow in an engine may vary due

to changes in inlet conditions caused by atmospheric conditions or aircraft

maneuvers; therefore, mass flow is a noise factor as defined in Section 3.

The ordinate in Figure 8 is pressure rise across a stage of the compressor. When

conditions are at their nominal state, the engine will generally remain on the

operating line with mass flow and pressure rise both changing as a function of thethrottle position set by the pilot. At a fixed throttle position, when mass flow is

reduced due to maneuvers or environmental conditions, the state of the engine

moves toward the surge line as indicated in step 1 of Figure 8. This pushes the

engine off the operating line and toward the failure-mode boundary. The amount

of mass-flow drop that can be tolerated before failure (step 3a or step 3b) is

sometimes called the surge margin which we interpret as an indication of the

operating window size. The technology described in Patent #4,086,022 can be

viewed as a means to exploit the incipient failure-mode physics (the rotating

annulus of airstep 2) to increase the surge margin. The treatments are designed

so that the incipient physics will lead to a pressure relief across the stage (step

3b). The advanced casing treatment increased fan stall margin by a staggering


27/37

27

20% under distorted inlet flow and with little loss in efficiency. *Koff, 2004, p.

582].

Supplementary Case StudyPaper Feeder. A similar approach was used to

improve the reliability of paper feeders. For friction-retard paper feeders, thestack force between the feedbelt and the paper stack is a critical system variable.

If it is too large the multifeed rate will be excessive. If the stack force is too small,

the misfeed rate will be excessive. Therefore, there is an operating window

between these two one-sided failure modes (Fig. 9).

When the range of papers is moderate, it is easy to develop a sufficient operating

window so that both the multifeed rate and the misfeed rate are very small.

However, for the large range of papers that are typically used in large production

copiers and printers, it is very difficult, or impossible, to develop a sufficient

operating window, as shown on the left of Figure 9. On the left hand side of

Figure 9, it is evident that no single value of stack force will simultaneously avoid

both multifeeds and misfeeds over the full range of paper weights. This was still

true after robust parameter design had been completed, so there was little hope

to improve it further beyond the great improvement that had already been

achieved.

The problem was resolved through the development of a stack force

relief/enhancement technology, U.S. Patent # 4,561,644 [Clausing, 1985]. This


28/37

28

technology uses two different values of the stack force, a small value for most

papers, and a larger value for heavy papers (as depicted on the right side of Fig.

9). Under normal conditions, the stack force is set to the small value. For most

common paper weights this works very reliably. If a larger paper weight is used, a

misfeed condition may begin to emerge. A sensor near the retard roll is designed

to sense the arrival of the lead edge of the sheet. If an incipient misfeed occurs,

the paper will not arrive within the desired time period. Under this condition, the

stack force is increased to the large value. This was done by energizing the

solenoid 90 in Figure 5, which pushed the feeder around the pivot 11, thus

increasing the stack force. Thus, the machine was able to reliably feed the full

range of paper weights.

Summary of the Strategy. Exploit the physical mechanisms associated with an

incipient failure to off-set the failure mode, thereby increasing size of theoperating window.

4.3. Employ Two Different Operating Modes

In some cases, the development process reaches a state in which the system has a

limited operating window between multiple one-sided failure modes and


29/37

29

therefore cannot operate reliably. In such cases, it is often advisable to change

from a single operating mode to two operating modes. Separately designing two

distinct operating modes enables significant design freedom to seek better

resistance to the failure modes. This strategy is often similar to the strategy use

physics of incipient failure to avoid failure and in fact the two strategies can

overlap. However two key distinctions should be made: (1) Incipient failure-mode

physics do not always lead to clearly distinct operating modes, and (2) the switch

between two modes need not be cued by incipient failure physics and can instead

be cued by operator inputs or state variables of the system.

Primary Case StudyPaper Feeder. A failure mode of friction retard paper

feeders (Fig. 5) is excessive wear of the retard roll. In previous designs the roll had

been rotated approximately once per hour to distribute the wear over the entireroll. Nevertheless, the wear was excessive, and was a considerable expense in

service cost and lost production of the copier/printer. The critical variable that

determines the wear of the retard roll is the force between the feedbelt and the

retard roll, F, multiplied by the contact distance D between the feedbelt and the

retard roll. The product, FD, is the work that the retard roll can do to remove

energy from the second sheet, and thus stop the second sheet. However, this is

also the work that causes wear of the retard roll.

The result is as shown in Figure 10. With the previous design, one system variable

FD has control of both of the one-sided failure modes, excessive multifeeds and

excessive wear of the retard roll. Maurice Holmes at Xerox recognized that this

problem could be resolved through a redesign of the retard mechanism by adding


30/37

30

a second operating mode. The innovation was included in the advanced paper

feeder that first went into production in the Xerox 1075 copier in 1981,Patent #

4,475,732 [Clausing et al., 1984].

The inventive process that led to this invention is well described in terms of thetheory of inventive problem solving (TRIZ). The TRIZ process generally begins by

framing the current problem as a conflict. In this case, there was an engineering

conflict between avoiding multifeeds and avoiding excessive wear. In TRIZ, one

effective way to seek a conflict resolution is through Sufield or substance-field

analysis [Clausing and Fey, 2004]. Simple Sufield diagrams are in the form of a

triad. The relevant triad diagram for the retard-roll problem is shown in the left

hand side of Figure 11. Here substances are (1) the paper and (2) the roll/shaft.

The field is the contact force. TRIZ includes many standards for the creativerevision of the Sufield. One of the standards is: To enhance the effectiveness of

the Sufield, transform one substance into an independently controlled Sufield,

thus generating a chain Sufield, p. 112. This can be implemented by introducing a

field between the retard roll and its shaft (as shown in right hand side of Fig. 11).

This is as far as Sufield analysis will take us. Now we have to use science and art to

identify a field and a component for creating the field that will open an operating

window. One such approach is to insert a friction brake with a brake torque T intothe design to produce a field between the retard roll and its shaft (U.S. Patent

4,475,732). This field creates the possibility of two distinct operating modes: (1)

When the torque that is applied to the roll is less than T, the roll remains

stationary, and (2) when the torque that is applied to the roll is greater than T, the

roll rotates.

The torque that is applied to the retard roll is pro- duced by the friction from the

belt or the paper, whichever is contacting the roll. When one sheet of paper is

between the roll and the feedbelt, the friction coefficient has a value of 2, which

overcomes the brake torque. Therefore, the roll rotates, and there is not any

wear. When two sheets of paper are between the roll and the feedbelt, the

friction coefficient is 0.6, and the brake torque prevents rotation of the retard

roll. Thus the second sheet is stopped.


31/37

31

The addition of the new operating mode created an additional design parameter

brake torque which sets the condition for the switch between the two modes.

Thus, the design space expands from a 1-D operating window to a 2-D operating

window (Fig. 12). If the brake torque is set to an appropriate value, the retard rollwill only rub against the paper when the incipient multifeed condition actually

occurs. In this case, the excessive-wear failure-mode boundary is never active and

a new failure mode (paper damage) becomes the limiting factor on parameter FD,

leaving a greatly increased operating window.

Supplementary Case StudyJet Engines. A similar approach was used to

simultaneously avoid two one-sided failure modes associated with combustion in

jet engines. A combustor is a part of a jet engine in which fuel is injected into the

air stream, mixed with air, and burned. Two key failure modes of a combustor are

concerned with the composition of the exhaust gas, which is tightly regulated to

protect the environment. One failure mode is excessive production of carbon

monoxide (CO), which occurs with an overly lean mixture and low temperature in

the combustion zone. Another failure mode is excessive production of oxides of

nitrogen (NOX), which is associated with overly high temperature in the

combustion zone. Given the changes in the thrust demands (and many other

parameters that vary), it is a challenge to maintain the combustion conditions inthe small operating window between the failure modes. In the 1970s a new

technology called two-zone or staged combustion substantially increased the

operating window by affording multiple operating modes [Markowski, Lohmann,

and Reilly, 1976; Lefebvre, 1999]. When the demand for thrust is low, all the

combustion takes place in a single primary zone. When thrust demands are


32/37

32

highest, the engine automatically switches to a mode in which combustion occurs

in two different zones each of which is functioning within the operating window

between the CO and NOX related failure modes. This technology has been

developed through many inventions including Patent #4,052,844 [Caruel,

Quillevere, and Gastebois, 1977] and has become popular especially in gas

turbine engines for ground based power [Washam, 1983]. As in the case of the

paper feeders with a friction brake, the system automatically switches between

two modes of operation in order to increase the operating window between two

coupled one-sided failure-mode boundaries.

Summary of Strategy. When it is not possible to simultaneously avoid two one-

sided failure modes due to a wide range of noise values, consider defining two

distinct operating modes so that at least one of the failure modes will be movedto increase the size of the operating window.

4.4. Identify and Exploit Dependencies among Failure Modes

In the operating-window approach, the parameter space is sketched out and the

failure mode boundaries are identified. In the sketch, it is often the case that the

parameters associated with the axes are not independent. A small change

induced in one parameter will have an associated effect on the other one. It

seems clear that such dependencies can influence system reliability. What is

sometimes overlooked is that they often provide an opportunity to use the

dependence to stay within the operating window.


33/37

33

Primary Case StudyJet Engines. An example is afforded by turbine blade cooling

systems [Sidwell, 2004]. The physical layout of the system is described in Figure

13. Air from the compressor is routed to the first-stage turbine blades. The

cooling flow path includes a Tangential On-Board Injector, which brings the flow

from a supply at Ps into the rotating parts of the engine. The area between therotating seal and the blades acts as a plenum storing compressed gas at a

pressure Pp. The gas then flows through each of the many first stage blades. The

purpose of this flow is to cool the surface of the blades and thereby avoid the

failure mode of early blade oxidation.

To apply operating-window methods to this scenario, one may first sketch the

parameter space and the failure-mode boundaries. Figure 14 depicts a highly

simplified window with just two failure modes, oxidation of blade #1 and

oxidation of blade #2. Manufacturing variation may excite failure mode #1

(oxidation of blade #1) if its flow passages are constricted causing m1 to drop.

However, the schematic diagram of Figure 14 suggests that there is a dependency

among the failure modes. Any small drop in m1 tends to cause a rise in plenum

pressure and a resulting rise in m2. The reverse is also trueany small drop in m2


34/37

34

tends to cause a rise in plenum pressure and a resulting rise in m1.

Thisinterdependency of the failure modes creates an opportunity to create larger

distance from both failure modes. Turbine blades are routinely tested for their

flow characteristics. Sidwell proposed that this test could be used to sort the

blades into low flow, medium flow, and high-flow classes. In this way, a second

interdependency is added to the system. The low m1 due to the sorting process

brings about a low m2. The nature of the interdependency caused by the plenum

causes the two effects to cancel (or very nearly cancel) as depicted in Figure 14.

Sidwell *2004+ estimated that binningturbine blades will increase the life of the

high flow and medium flow blades by 50% or more and would enable low-flowing

blades to be used with approximately the same life as current engines.

Supplementary Case StudyPaper Feeder. In a document feeder for a copier it ishighly desirable to feed from the bottom of the stack of documents. This leaves

the top of the stack free to receive the recirculated document after it has been

copied. The most advanced document-feeder technology uses air to move the

document, which minimizes damage to the document. Such feeders typically use

a combination of positive air pressure and negative air pressure (vacuum). The

positive air pressure is used to levitate the document stack (otherwise the weight

of the document stack would tend to cause both misfeeds and multifeeds).

Therefore, a sufficient pressure under the stack is required to avoid both misfeeds

and multifeeds. However, excessive pressure under the stack could cause the last

sheet to blow away. Therefore, good system design requires an operating window

between inadequate pressure and excessive pressure, as shown in Figure 15.


35/37


36/37

36

5. SUMMARY

Reliability is one of the most important characteristics of an engineering system.

Probabilistic formulations of reliability are useful for component selection,

verification testing, and field-service management. However, at the early stagesof system architecting and concept design, probabilistic formulations are not as

helpful. We propose that thinking in terms of physical mechanisms of failure is

much more effective and that the fundamental principle of reliability engineering

is failure-mode avoidance.

A useful reliability-engineering concept is the operating window, which is the

region in noise parameter space that avoids failure modes. In this paper we have

given a mathematical definition of the operating window. We have shown that

adding to the window increases the reliability regardless of the probability

distributions of the noise factors. To this we add the principle that this should be

done early and rapidly during the system development. In particular, concept

design changes frequently add large regions to the operating window and account

for some of the largest improvements to reliability of systems over the course of

their development.

To illustrate this approach, we have described four strategies for increasing

operating window through concept design. Each strategy is illustrated by two case

studies, one from the field of paper feeders for copiers and printers, and the

other from the field of jet engines. Each case study includes past inventions that

significantly improved reliability. By showing the theory and eight case studies we

have displayed both the fundamentals and the diversity of industrial applications

of this important approach to the development of reliable systems.


37/37

REFERENCES

S.S. Rao, Reliability-based design, McGraw Hill, New York. 1992

M Pecht and A Dasgupta,Physics of failure,aan approach to reliable product

development,J Inst Environ Sci(1995)

G Taguchi,Taguchi on Robust technology development,ASME Press,New

York,1993

Internet database

Documents

Improvement of System Reliability and Failure Avoidance