Practical System Reliability

Practical System ReliabilityBy

Nicoline Reynecke920210605

A research paper submitted for the subjectReliability Management

Faculty of Engineering and the Build EnvironmentOf theUniversity of Johannesburg

November 2013

Table of ContentsIntroduction3Practical system reliability4Availability4How does a high availability (HA) system work?4Downtime budget5Quality engineering5Principles of reliability7How to predict system reliability?7Hazard rate8Reliability of systems8Series systems9Parallel systems9Fault tree analysis (FTA)9Failure mode and effective analysis (FEMA)11Maintenance Strategy12Reactive maintenance12Preventive maintenance12Predictive maintenance13Proactive Maintenance13Reliability centered maintenance (RCM)13Conclusion16Bibliography17

IntroductionIn todays revolutionary age right across the world, an organization is only as good as the service it provides. The service an organization can provide to the public hinges on the reliability of their system and their product. Reliability and maintainability is one of the most important aspects a company can invest in. If properly designed, the reliability of a product will ensure client goodwill as well as a good name for the company, a name that can be trusted. Certain tools and methods are available to organizations to evaluate and update their reliability of their system as well as the reliability of the products provided to the public. Some of those tools are the fault tree analysis (FTA), failure mode and effective analysis (FMEA) as well as the maintenance strategies.

Practical system reliabilityThe rise of the internet, sophisticated computing and communication technologies, and globalization have raised customers expectations of powerful always on services [1]. In this day and age this is very important for system reliability. What is reliability engineering? Reliability engineering emphasizes dependability in the lifecycle management of a product [2]. If a customer cannot get what they want, in the time frame the need it, another service provider is just a click away. Because of this, highly available services are vital in any organization.AvailabilityThe cost associated with poor service availability or reliability is: Loss of brand reputation and customer good will Direct loss of customers and business Higher maintenance related operating expenses Financial penalties or liquidated damagesWikipedia states that availability is the degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e. a random time. Simply put, availability is the proportion of time a system is in a functioning condition. [3]To calculate availability mathematically the following formula is used:Availability = Uptime/ (Uptime + Downtime) = MTTF/ (MTTF + MTTR)High-availability systems are designed in such a way that they detect, isolate, alarm and recover from failures that will inevitably happen because no system will last forever. To ensure this high availability the system would have redundant elements that can be switched to when there is a failure so that no single failure can result in a loss of service. This high availability design principal is known as no single point of failure [1].How does a high availability (HA) system work?A HA system will have a suite of failure detectors which typically contain both hardware and software mechanisms. When these detectors are triggered the system will isolate the failure to a specific point, be it software or hardware related, and will then activate the suitable recovery scheme. If the system does not activate the suitable recovery a secondary recovery will be triggered. Ultimately a human operator is responsible for any system, and if the system does not recover fast enough or successfully, then the operator will step in and do a manual recovery. Failures fall into two broad categories namely sub-acute failures and acute failures. Sub-acute failures usually do not impact the systems performance suddenly and can be corrected as soon as the operator can see to the failure. Acute failures suddenly and profoundly impact the service of a system and must be corrected immediately to ensure no loss of service. HA systems must be able to detect both failure types and trigger the popper recovery action to ensure that the impact on the system from failures remains small. Both failure types will cost the company not only money but man hours and time to repair. Downtime budgetA company must create a downtime budget to deal with these types of failures. The downtime budget can be created and managed to ensure that there is a plan of action for any type of failure. The factors that contribute to the downtime are divided into three categories that are based on the party it relates to namely: Customer attributable (due to the actions of the customer) Procedural Power failure, battery or generator Internal environment Traffic overload Planned event Product attributable (due to the design and implementation of the product) Hardware failure Design, hardware Design, software Procedural Planned event Third party attributable (due to the actions of others) Facility related Power failure or commercial External environment [1]In the article Single points of failure within systems of systems the author mentions that throughout the research the element that repeatedly showed vulnerability was data. If a system cannot create, store or transmit data, then that system as a whole can fail, and the data becomes a single point of failure. Data has a life cycle; it can be created, stored, used, shared, archived and then destroyed [4]. In todays advance age communication and the internet have complex and dynamic system of systems. A system or network must have multiple points of access and people can attack those points with little knowledge and skill. Data is not only at risk from outside attack but also form the everyday user, the components that are within the SoS and the physical structure of the given network or internet. Any of these risks can become a single point of failure, and companies need to have a system in place for when this happens. For each system that is in place for these failures the benefits increase, but so does the costs. If companies are aware of these points of failures, downtime budgets can be effectively in place before hand.Quality engineeringQuality is an important factor in system reliability. But what is quality? Quality is the totality of features and characteristic of a product, process or service that bear on its ability to satisfy stated or implied needs [5]. David Garvin defined the concept of eight dimensions of quality. Some of the dimensions are mutually reinforcing, whereas others are not, improvement in one may be at the expense of others. Understanding the trade-offs desired by customers among these dimensions can help build a competitive advantage [6] [7]. The eight dimensions are:

Figure 1. Source: BSI education The concept of quality

Table 1DimensionExample

PerformanceHow efficient a product achieves its intended purposeDoes the ebook reader let you read electronic books and magazines with ease

FeaturesThe elements that supplement the products basic performanceThe reflow function on an ebook reader

ConformanceDoes the product meet with the specifications for its useDoes the ereader meet the specifications of being able to show electronic formats

ServiceabilityThe ease with which a product can be repairedIf the ereader breaks, is it easy to repair

Perceived QualityThe is based on the customers view and opinionsHow the customer sees the ereader

AestheticsHow the product influences the customers sensesIs the ereader easy to hold, small and travel size, or big enough to read comfortably on

DurabilityHow much the product can withstand stress without failureIf the ereader falls will it stay in one piece

ReliabilityHow the product performs over its life cycle with consistencyWill the ereader still work as well as the day the customer buys it five years from now

By applying statistical analysis of the products characteristics the quality can be determined. To calculate the statistical analysis the mean, standard deviation, probability and probability density function need to be calculated.Principles of reliabilityWhat is reliability? Reliability is typically defined as the ability to perform a specified or required function under specific condition for a stated period of time [1].There is a relationship between quality and the reliability of a product. The reliability of a product is its ability to retain its quality as time progresses [8]. Reliability (R) and unreliability (F) varies with time. The reliability of a product decreases with time, while unreliability will increase with time. The events of reliability and unreliability are complementary and their product must equal 1.R(t) + F(t) = 1

To have measures of the reliability of a product consider the non-repairable items and repairable items.Table 2Non-repairable items

Repairable items

Mean time to fail = Total up time/ number of failures

Mean down time = total down time/ number of failures

Mean failure rate = number of failures/ total up time

How to predict system reliability?Some of the methods used to predict the reliability of a system is the fault tree analysis, network analysis and Monte Carlo simulation. All of these methods evaluate the probability of the component failure in a system.

Hazard rateAnother important measure of a products quality and reliability is the hazard rate function or the failure rate denoted by (t). The bathtub curve shows the most general form of the failure rate and consist of three distinct phases namely the early failure, useful life and the wear-out failure.

Figure 2. Source: Practical system reliability by E. Bauer, Z. Zhang and DA. KimberThe early failure is when the failure rate decreases with time. When the product is a new design certain early failures can occur because of the design faults, poor quality of the components, manufacturing faults, installation errors as well as operating and maintenance errors. The hazard rate becomes less as time moves on because the design faults might be corrected, weak components are replaced with better components, and the user becomes more familiar with the installation of the product. In the next phase the useful life of a product is characterized by a constant low failure rate as indicated on the sketch. In this phase all the weak components have been replaced, the design, manufacture, installation, operation and maintenance errors are corrected. In the last phase known as the wear-out failure phase, the failure rate increases with time. The increase is due to individual components reaching the end of their expected design life for the particular product in other words the product is wearing out [8].Reliability of systemsA system is a set of interacting or interdependent components forming an integrated whole or a set of elements (often called components) and relationships which are different from relationships of the set or its elements to other elements or sets [9]. The reliability of a system will then depend on the smaller elements or components that make up the system. Configuration of these components plays a big role in the systems performance. A recent paper argued that the performance criteria of manufacturing systems, such as reliability, productivity and quality, are determined by different configurations. Two fundamental configurations for the systems components are the series design and the parallel design. These two types of configurations form the basis of the reliability modeling and analysis of the more complex configurations [10].

Series systemsRSYST = R1R2RiRmReliability of a series systemFigure 3AB

This type of system will fail if any one of the elements fails in the system. The systems reliability is then equal to the product of each individual elements reliability. The failure rate of the system will be the sum of the individual element/components failure rates.Parallel systemsThe system will still be able to function provided that any one of the components in the system still functions. A system that is in parallel has active redundancy. Active redundancy is a design concept that increases operational availability and that reduces operating cost by automating most critical maintenance actions [11]. The overall system unreliability is the product of the individual element system reliability.FSYST = F1F2.FjFnUnreliability of a parallel systemAB

Figure 4

Fault tree analysis (FTA)The fault tree is an established from and is built from the top down using logical AND/OR gates to combine the causal events [12]. Computer programs are available to calculate these top down possibilities on a fault tree. The top event of a fault tree must be chosen well to ensure that the analysis is not too wide or narrow to produce the results that are necessary. More than one fault tree analysis can be done in a system, as each fault tree only represents one of the many possible types of failures in a system. The FTA has been used in many industries like the air and space industry, chemical industry, electrical industry, transport industry etc. In a case study done on radio based railroad crossing the author concluded that the formal FTA is promising, if not always an easy topic. Because FTA is human readable and understandable with a logical background structure the industry will accept this method easily [13].The following table and figure shows the different common gates used for a fault tree analysis as well as an example of fault tree analysis done on a press unit at a paper mill.Table 3 Source: Reliability and risk assessment by JD. Andrews and TR. Moss Gate SymbolGate nameCasual relation

1

AND gateOutput event occurs if all input events occur simultaneously

2

OR gateOutput event occurs if at least one input events occur

3

n-out-of-n gateOutput event occurs if m-out of-n input events occur

4

Exclusive OR gateOutput event occurs if one, but not both, of the two input events occur

5

Inhibit gateInput produces output when input event and the conditional event occur

6

Priority AND gate

Output even occurs if all input events occur in the order from left to right

7

Not gateOutput even occurs if the input event does not

Figure 5 Example of a FTA for a press unit in a paper millFailure mode and effective analysis (FEMA)Failure mode and effective analysis (FEMA) is the procedure by which the each potential failure mode in a system is analyzed to determine the effect it has on the system and then to classify it according to its severity [12].The following shows an example of a FMEA of a common house hold washing machineTable 4 Source: Burgehugheswalsh.co.ukFUNCTIONAL FMEA

FunctionFunctional failure modePotential effects of failureSeverityPotential causes of failureOccurrenceCurrent process controls preventionCurrent process controls detectionDetection*RPNSxOxDResponsibility and target completion dateAction taken

Load dirty clothesNo LoadNo wash3User error2NoneBuilt in test954John - completedInclude sensor for load detection

Over LoadVery poor wash5User error6Weigh load functionBuilt in test9270John - completedInclude sensor for load detection

Under LoadPoor wash4User error4Weigh load functionBuilt in test9144Mike - completedInclude sensor for load detection

Hidden extreme mix of loadColour run6Items covered by others9NoneNone9486Mike - completedClear lid for visual detection

Hidden extreme mix of loadFabric shrink7Items covered by others9NoneNone9567Jane - completedNone

Unintended load- foreign object in loadObject damages items7User error3NoneNone10210Jane - completedNone

Unintended load- foreign object in loadObject damages machine8User error2NoneNone10160Jane - completedNone

*RPN is the risk priority number

Maintenance StrategyFor a working reliability plan maintenance must be taken into account. The approach to maintenance of a system is a follows:UNPLANNED MAINTENANCEPLANNED MAINTENANCEPREVENTIVE MAINTENANCEREACTIVE MAINTENANCEPREDICTIVEMAINTENANCEPROACTIVE MAINTENANCE

Figure 6 Source: Reliability Strategy and Plan - www.utk.education

The benefits of a planned maintenance system are numerous and have progressive effects on a company. Some of the benefits include: Table 5 Source: Reliability Strategy and Plan www.utk.educationReduction inIncrease in

The size and scale of repairsAccountability for all cash spent

DowntimeEquipments useful life

Number of repairsOperator, mechanic and public safety

OvertimeConsistency and quality of output

In maintenance costs Equipment availability

Overall cost per product unitControl over parts

Reactive maintenanceReactive maintenance is also known as breakdown or run to failure maintenance. This type of maintenance only takes place when it is absolutely necessary. Few expenses or effort is allocated towards this type of maintenance until it is required. Some examples of reactive maintenance are light bulbs and electronic circuit boards. Preventive maintenancePreventive maintenance is also known as time-based or interval-based maintenance. This type of maintenance is scheduled and done on an operating time interval to prolong the life of the equipment and to prevent equipment failure. Preventive maintenance does not take equipment condition into consideration. This type of maintenance can be costly as well as ineffective. The type of maintenance tasks that are performed during preventive maintenance will be cleaning, inspection, and adjustments, lubrication as well as parts replacement and so on. Some examples of preventive maintenance are car maintenance and machine tooling. Predictive maintenancePredictive maintenance is also known as condition based maintenance. This type of maintenance is done through failure forecasting by analyzing the equipment condition. The analysis can be done by looking at trend parameters like vibration, temperature and flow. The maintenance is also scheduled so it will not interfere with normal operation and production times. Predictive maintenance reduces costs and improves reliability. Some benefits of predictive maintenance is improvement of mean time to repair and reduces inventory levels. The most commonly used preventive maintenance techniques include vibration monitoring, oil analysis, thermography, shock pulse measurement, ultrasonic and x-ray scanning. An example of predictive maintenance is knowing the service life of a microwave is 5 years, and then replacing that microwave just before the 5 years are up, even if the microwave is still in working condition.Proactive MaintenanceProactive maintenance is both preventive and predictive maintenance. Proactive maintenance improves maintenance through better design, installation, maintenance procedures, workmanship and scheduling [14]. Proactive maintenance employs the following basic techniques to extend machinery life: Specification for new/rebuild equipment; Precision rebuild and installation; Failed-part analysis (FPA); Root-cause failure analysis (RCFA); Reliability engineering; Rebuild certification/verification; Age exploration and Recurrence control. Reliability centered maintenance (RCM)Reliability centered maintenance is sum of all four maintenance methods mentioned earlier. RCM is an ongoing process which determines the optimum reactive, preventive, predictive and proactive maintenance practices in order to provide the required reliability at the minimum cost [14].Reliability Centered MaintenanceReactiveMaintenancePreventiveMaintenancePredictiveMaintenanceProactiveMaintenance Small items Non- critical Inconsequent Unlikely to fail redundant Subject to wear out Consumable Replacement Failure pattern known

Random failure Patterns not subject to wear PM induced failures RCFA FMEA AEFigure 7 Source: Reliability strategy and plan www.utk.education

RCM finds its roots in the early 1960s. The first industry to develop RCM initially was the North American civil aviation industry. In the mid 1970 the US Department of Defense commissioned a report on the subject of RCM, and this report written by Stanley Nowlan and Howard Heap is still being used today, and is considered one of the most important documents available on the subject. The RCM analysis is as follows: What does the system or equipment do? What function failures are likely to occur? What are the likely consequences of these failures? What can be done to prevent these functional failures? RCM decision logic tree is then done based on the answers to the above questions. The following is an exaplme of a RCM decision logic tree.Is establishing redundancy cost and priority-justified?RedesignWill the failure of the facility or equipment item have a direct and adverse effect on safety or critical mission operations?

Is the item expendable?Can redesign solve the problem permanently and cost effectively?Is there predictive technology that will monitor the condition and give sufficient warning of an impending failure?Is PdM cost and priority-justified?I there an effective PM task that will minimize functional failure?Accept riskInstall redundancy unit(s)Install PM task and scheduleDefine PM task and scheduleYesYesYesYesYesYesYesNoNoNoNoNoNoNo

Figure 8 Source Reliability strategy and plan www.utk.education

For a successful implementation of RCM the following factors must be taken into consideration: Clear project goals Management support and a commitment to introduce a controlled maintenance environment Union involvement Good understanding of RCM philosophy by plant staff Pilot RCM application to demonstrate success and build support Sufficient resources for both the review and subsequent implementation of recommendations Clear documentation of results to facilitate acceptance of recommendations Integration with PdM maintenance capability [14]

ConclusionReliability is one of the most important concepts in any organization. From paper mills right through to airlines, reliability is a good representation of an organizations worth. The more reliable a system or organization the better the value a customer will place on that organization will be. A system that is reliable will also save the company money and ensure that reputation of the company is always shown in a positive light. Some of the tools available like the functional FMEA and the fault tree analysis are widely used to ensure an organizations reliability. Reliability centered maintenance (RCM) is another method to determine a systems reliability and save the company money. Reliability is not just expected from the customer, it is a necessity that can have a wide and full effect on the customer, the organization as well as the industry.

Bibliography

[1] E. Bauer, X. Zhang and D. Kimber, Practical system reliability, John Wiley & Sons,INC, 2009. [2] Wikipedia, "Reliability Engineering," [Online]. Available: http://en.wikipedia.org/wiki/Reliability_engineering.[3] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/Availability. [Accessed 4 September 2013].[4] C. S. Alliance, "Cloud Security Alliance:Security guidance for critical areas of focus in cloud computing V2.1," Cloud Security Alliance, 2009. [Online]. Available: https://cloudsecurityalliance.org/csaguide.pdf. [Accessed 5 September 2013].[5] ISO, ISO 9000:2005 Quality management systems -- Fundamentals and vocabulary, ISO, 2002. [6] Wikipedia, "Wikepedia," [Online]. Available: http://en.wikipedia.org/wiki/Eight_dimensions_of_quality. [Accessed 5 September 2013].[7] D. Needham, "BSI education," [Online]. Available: www.bsieducation.org/.../Lecture-Materials-Concept-of-Quality.doc. [Accessed 30 September 2013].[8] J. Bentley, An introduction to reliability and quality engineering, Addison-Wesley longman, Ltd, 1999. [9] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/System. [Accessed 6 September 2013].[10] J. Sun, L. Xi, S. Du and B. Ju, "Reliability modeling and analysis of serial-parallel hybrid multi-operational manufacturing system considering dimensional quality, tool degradation and system configuration," International Journal of Production Economics, vol. 114, no. 1, pp. 149-164, 2008. [11] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/Active_redundancy.[12] J. Andrews and T. Moss, Reliability and risk assessment, Longman group, UK , Ltd., 1993. [13] F. Ortmeier and G. Schellhorn, "Formal Fault Tree Analysis: Practical Experiences," Electronic Notes in Theoretical Computer Science , pp. 139-151, 2007. [14] U. Education, "UTK Education," [Online]. Available: http://web.utk.edu/~kkirby/IE591/Module03.pdf. [Accessed 25 09 2013].

17

Documents

Practical System Reliability