View
5
Download
0
Category
Preview:
Citation preview
Mirek GenerowiczIampE Systems Pty Ltd - Australia
Preventing preventable failures
The need for credible failure ratesIEC 61511 Edition 2 (2016) reduced requirements for hardware fault toleranceInstead the new edition emphasises that the lsquoreliability data used in quantifying effect of random failures should be credible traceable documented and justified based of field feedback from similar devices used in a similar operating environmentrsquo (IEC 61511-1 sect1193)
In practice how do we find lsquocredible reliability datarsquo
Answering this question revealed that most failure probability calculations are based on a misunderstanding of probability theory
lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed
This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful
We need a different emphasisWe cannot predict probability of failure with precision
bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved
In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates
by preventing preventable failures
Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions
So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo
lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
The need for credible failure ratesIEC 61511 Edition 2 (2016) reduced requirements for hardware fault toleranceInstead the new edition emphasises that the lsquoreliability data used in quantifying effect of random failures should be credible traceable documented and justified based of field feedback from similar devices used in a similar operating environmentrsquo (IEC 61511-1 sect1193)
In practice how do we find lsquocredible reliability datarsquo
Answering this question revealed that most failure probability calculations are based on a misunderstanding of probability theory
lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed
This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful
We need a different emphasisWe cannot predict probability of failure with precision
bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved
In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates
by preventing preventable failures
Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions
So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo
lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed
This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful
We need a different emphasisWe cannot predict probability of failure with precision
bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved
In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates
by preventing preventable failures
Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions
So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo
lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
We need a different emphasisWe cannot predict probability of failure with precision
bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved
In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates
by preventing preventable failures
Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions
So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo
lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions
So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo
lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
HFT in IEC 61511 Edition 2New simpler requirement for HFT
SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60
or lsquodominant failure to the safe statersquo (difficult to achieve)
SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures
(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service
IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment
HFT in IEC 61511 Edition 2
Why not 90
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded
We can evaluate the uncertainty with the chi-squared function χ2
α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period
Confidence level does not depend directly on population size
120582120582120572120572 =120594120594
2(120572120572 120584120584)2119879119879
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
λ90 compared with λ70 12058212058290 =
120594120594 2(012119899119899 + 2)
2119879119879
Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000
λAVG per hour - 20E-07 20E-06 20E-05
λ50 per hour 14E-07 34E-07 21E-06 20E-05
λ70 per hour 24E-07 49E-07 25E-06 21E-05
λ90 per hour 46E-07 78E-07 31E-06 23E-05
λ90 λ50 = 33 23 14 11
λ90 λ70 = 19 16 12 11
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 200 | - 0 | 200 | 007 | 007 | -$200 | - 1 | 0 | ||||||||||||||||||
0 | 0 | assume min-max is 5 std dev | |||||||||||||||||||||||
-03010299957 | 03010299957 | $000 | |||||||||||||||||||||||
-04771212547 | 04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
-06020599913 | 06020599913 | -2 | - 0 | 0 | 0003124312 | ||||||||||||||||||||
-06989700043 | 06989700043 | -1933 | - 196700 | 0 | 03133734998 | ||||||||||||||||||||
-07781512504 | 07781512504 | -1866 | - 190000 | 0 | 03987139641 | ||||||||||||||||||||
-084509804 | 084509804 | -1799 | - 183300 | 0 | 04932600581 | ||||||||||||||||||||
-0903089987 | 0903089987 | -1732 | - 176600 | 6 | 05933429394 | ||||||||||||||||||||
-09542425094 | 09542425094 | -1665 | - 169900 | 7 | 06939863577 | ||||||||||||||||||||
-1 | 1 | -1598 | - 163200 | 7 | 07892442256 | ||||||||||||||||||||
-10413926852 | 10413926852 | -1531 | - 156500 | 6 | 08727446968 | ||||||||||||||||||||
-1079181246 | 1079181246 | -1464 | - 149800 | 4 | 09383791495 | ||||||||||||||||||||
-11139433523 | 11139433523 | -1397 | - 143100 | 5 | 09810356861 | ||||||||||||||||||||
-11461280357 | 11461280357 | -133 | - 136400 | 3 | 09972558377 | ||||||||||||||||||||
-11760912591 | 11760912591 | -1263 | - 129700 | 3 | 09856975892 | ||||||||||||||||||||
-12041199827 | 12041199827 | -1196 | - 123000 | 3 | 09473187363 | ||||||||||||||||||||
-12304489214 | 12304489214 | -1129 | - 116300 | 2 | 08852458204 | ||||||||||||||||||||
-12552725051 | 12552725051 | -1062 | - 109600 | 2 | 08043535232 | ||||||||||||||||||||
-1278753601 | 1278753601 | -0995 | - 102900 | 2 | 07106330101 | ||||||||||||||||||||
-13010299957 | 13010299957 | -0928 | - 096200 | 1 | 06104626698 | ||||||||||||||||||||
-13222192947 | 13222192947 | -0861 | - 089500 | 1 | 05099037095 | ||||||||||||||||||||
-13424226808 | 13424226808 | -0794 | - 082800 | 1 | 04141260563 | ||||||||||||||||||||
-1361727836 | 1361727836 | -0727 | - 076100 | 1 | 03270335188 | ||||||||||||||||||||
-13802112417 | 13802112417 | -066 | - 069400 | 1 | 02511119054 | ||||||||||||||||||||
-13979400087 | 13979400087 | -0593 | - 062700 | 1 | 01874811743 | ||||||||||||||||||||
-1414973348 | 1414973348 | -0526 | - 056000 | 0 | 0136101638 | ||||||||||||||||||||
-14313637642 | 14313637642 | -0459 | - 049300 | 1 | 00960692421 | ||||||||||||||||||||
-14471580313 | 14471580313 | -0392 | - 042600 | 0 | 00659357123 | ||||||||||||||||||||
-14623979979 | 14623979979 | -0325 | - 035900 | 0 | 00440019948 | ||||||||||||||||||||
-14771212547 | 14771212547 | -0258 | - 029200 | 1 | 00285521854 | ||||||||||||||||||||
-14913616938 | 14913616938 | ||||||||||||||||||||||||
-15051499783 | 15051499783 | ||||||||||||||||||||||||
-15185139399 | 15185139399 | ||||||||||||||||||||||||
-1531478917 | 1531478917 | ||||||||||||||||||||||||
-15440680444 | 15440680444 | ||||||||||||||||||||||||
-15563025008 | 15563025008 | ||||||||||||||||||||||||
-15682017241 | 15682017241 | ||||||||||||||||||||||||
-15797835966 | 15797835966 | ||||||||||||||||||||||||
-1591064607 | 1591064607 | ||||||||||||||||||||||||
-16020599913 | 16020599913 | ||||||||||||||||||||||||
-16127838567 | 16127838567 | ||||||||||||||||||||||||
-16232492904 | 16232492904 | ||||||||||||||||||||||||
-16334684556 | 16334684556 | ||||||||||||||||||||||||
-16434526765 | 16434526765 | ||||||||||||||||||||||||
-16532125138 | 16532125138 | ||||||||||||||||||||||||
-16627578317 | 16627578317 | ||||||||||||||||||||||||
-16720978579 | 16720978579 | ||||||||||||||||||||||||
-16812412374 | 16812412374 | ||||||||||||||||||||||||
-169019608 | 169019608 | ||||||||||||||||||||||||
-16989700043 | 16989700043 | ||||||||||||||||||||||||
-17075701761 | 17075701761 | ||||||||||||||||||||||||
-17160033436 | 17160033436 | ||||||||||||||||||||||||
-17242758696 | 17242758696 | ||||||||||||||||||||||||
-17323937598 | 17323937598 | ||||||||||||||||||||||||
-17403626895 | 17403626895 | ||||||||||||||||||||||||
-1748188027 | 1748188027 | ||||||||||||||||||||||||
-17558748557 | 17558748557 | ||||||||||||||||||||||||
-17634279936 | 17634279936 | ||||||||||||||||||||||||
-17708520116 | 17708520116 | ||||||||||||||||||||||||
Average | - 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | - 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | - 2 | hours | |||||||||||||||||||||||
l90 | -54E-01 | per hour | |||||||||||||||||||||||
Mean | -74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 26E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 200 | 200 | 007 | 007 | $000 | 1 | 0 | ||||||||||||||||||
0 | assume min-max is 5 std dev | ||||||||||||||||||||||||
03010299957 | $000 | ||||||||||||||||||||||||
04771212547 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||||
06020599913 | 0 | - 0 | 1 | 0003124312 | |||||||||||||||||||||
06989700043 | 0067 | 003400 | 0 | 00041547353 | |||||||||||||||||||||
07781512504 | 0134 | 010100 | 0 | 00071333981 | |||||||||||||||||||||
084509804 | 0201 | 016800 | 0 | 00119087147 | |||||||||||||||||||||
0903089987 | 0268 | 023500 | 0 | 00193307472 | |||||||||||||||||||||
09542425094 | 0335 | 030200 | 1 | 00305103872 | |||||||||||||||||||||
1 | 0402 | 036900 | 0 | 00468233112 | |||||||||||||||||||||
10413926852 | 0469 | 043600 | 0 | 00698701782 | |||||||||||||||||||||
1079181246 | 0536 | 050300 | 1 | 01013764093 | |||||||||||||||||||||
11139433523 | 0603 | 057000 | 1 | 01430201681 | |||||||||||||||||||||
11461280357 | 067 | 063700 | 0 | 01961882481 | |||||||||||||||||||||
11760912591 | 0737 | 070400 | 1 | 02616760765 | |||||||||||||||||||||
12041199827 | 0804 | 077100 | 1 | 03393675985 | |||||||||||||||||||||
12304489214 | 0871 | 083800 | 1 | 04279490399 | |||||||||||||||||||||
12552725051 | 0938 | 090500 | 1 | 0524721746 | |||||||||||||||||||||
1278753601 | 1005 | 097200 | 2 | 0625577896 | |||||||||||||||||||||
13010299957 | 1072 | 103900 | 1 | 07251854011 | |||||||||||||||||||||
13222192947 | 1139 | 110600 | 2 | 08173951106 | |||||||||||||||||||||
13424226808 | 1206 | 117300 | 3 | 08958397809 | |||||||||||||||||||||
1361727836 | 1273 | 124000 | 2 | 09546495624 | |||||||||||||||||||||
13802112417 | 134 | 130700 | 3 | 09891745568 | |||||||||||||||||||||
13979400087 | 1407 | 137400 | 4 | 09965915989 | |||||||||||||||||||||
1414973348 | 1474 | 144100 | 4 | 09762854839 | |||||||||||||||||||||
14313637642 | 1541 | 150800 | 5 | 09299332313 | |||||||||||||||||||||
14471580313 | 1608 | 157500 | 6 | 08612753716 | |||||||||||||||||||||
14623979979 | 1675 | 164200 | 7 | 07756175282 | |||||||||||||||||||||
14771212547 | 1742 | 170900 | 8 | 06791544131 | |||||||||||||||||||||
14913616938 | |||||||||||||||||||||||||
15051499783 | |||||||||||||||||||||||||
15185139399 | |||||||||||||||||||||||||
1531478917 | |||||||||||||||||||||||||
15440680444 | |||||||||||||||||||||||||
15563025008 | |||||||||||||||||||||||||
15682017241 | |||||||||||||||||||||||||
15797835966 | |||||||||||||||||||||||||
1591064607 | |||||||||||||||||||||||||
16020599913 | |||||||||||||||||||||||||
16127838567 | |||||||||||||||||||||||||
16232492904 | |||||||||||||||||||||||||
16334684556 | |||||||||||||||||||||||||
16434526765 | |||||||||||||||||||||||||
16532125138 | |||||||||||||||||||||||||
16627578317 | |||||||||||||||||||||||||
16720978579 | |||||||||||||||||||||||||
16812412374 | |||||||||||||||||||||||||
169019608 | |||||||||||||||||||||||||
16989700043 | |||||||||||||||||||||||||
17075701761 | |||||||||||||||||||||||||
17160033436 | |||||||||||||||||||||||||
17242758696 | |||||||||||||||||||||||||
17323937598 | |||||||||||||||||||||||||
17403626895 | |||||||||||||||||||||||||
1748188027 | |||||||||||||||||||||||||
17558748557 | |||||||||||||||||||||||||
17634279936 | |||||||||||||||||||||||||
17708520116 | |||||||||||||||||||||||||
Average | 135834 | ||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 1 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 1 | hours | |||||||||||||||||||||||
MTBF-13s | 1 | hours | |||||||||||||||||||||||
l90 | 12E+00 | per hour | |||||||||||||||||||||||
Mean | 74E-01 | per hour | |||||||||||||||||||||||
SD | 26E+00 | per hour | |||||||||||||||||||||||
13SD | 33E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 41E+00 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||
- 0 | 100 | 100 | 003 | 003 | $000 | 0 | 0 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||
05 | 20E+00 | $000 | |||||||||||||||||||||||
03333333333 | 30E+00 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||
025 | 40E+00 | 0 | - 0 | 0 | 18448778924 | ||||||||||||||||||||
02 | 50E+00 | 00335 | 001700 | 30 | 19010216777 | ||||||||||||||||||||
01666666667 | 60E+00 | 0067 | 005000 | 15 | 19737983481 | ||||||||||||||||||||
01428571429 | 70E+00 | 01005 | 008400 | 5 | 19940974276 | ||||||||||||||||||||
0125 | 80E+00 | 0134 | 011700 | 2 | 19590993373 | ||||||||||||||||||||
01111111111 | 90E+00 | 01675 | 015100 | 2 | 1869678694 | ||||||||||||||||||||
01 | 10E+01 | 0201 | 018400 | 1 | 17380866988 | ||||||||||||||||||||
00909090909 | 11E+01 | 02345 | 021800 | 0 | 15669274423 | ||||||||||||||||||||
00833333333 | 12E+01 | 0268 | 025100 | 1 | 13783125675 | ||||||||||||||||||||
00769230769 | 13E+01 | 03015 | 028500 | 0 | 11737945695 | ||||||||||||||||||||
00714285714 | 14E+01 | 0335 | 031800 | 1 | 09769791557 | ||||||||||||||||||||
00666666667 | 15E+01 | 03685 | 035200 | 0 | 07859530562 | ||||||||||||||||||||
00625 | 16E+01 | 0402 | 038500 | 0 | 0618990782 | ||||||||||||||||||||
00588235294 | 17E+01 | 04355 | 041900 | 0 | 04703947001 | ||||||||||||||||||||
00555555556 | 18E+01 | 0469 | 045200 | 0 | 03505454762 | ||||||||||||||||||||
00526315789 | 19E+01 | 05025 | 048600 | 1 | 0251645712 | ||||||||||||||||||||
005 | 20E+01 | 0536 | 051900 | 0 | 0177445853 | ||||||||||||||||||||
00476190476 | 21E+01 | 05695 | 055300 | 0 | 01203311178 | ||||||||||||||||||||
00454545455 | 22E+01 | 0603 | 058600 | 0 | 00802876309 | ||||||||||||||||||||
00434782609 | 23E+01 | 06365 | 062000 | 0 | 00514313198 | ||||||||||||||||||||
00416666667 | 24E+01 | 067 | 065300 | 0 | 00324707809 | ||||||||||||||||||||
004 | 25E+01 | 07035 | 068700 | 0 | 00196489202 | ||||||||||||||||||||
00384615385 | 26E+01 | 0737 | 072000 | 0 | 00117381087 | ||||||||||||||||||||
0037037037 | 27E+01 | 07705 | 075400 | 0 | 00067098222 | ||||||||||||||||||||
00357142857 | 28E+01 | 0804 | 078700 | 0 | 00037928426 | ||||||||||||||||||||
00344827586 | 29E+01 | 08375 | 082100 | 0 | 00020480692 | ||||||||||||||||||||
00333333333 | 30E+01 | 0871 | 085400 | 0 | 00010954506 | ||||||||||||||||||||
00322580645 | 31E+01 | ||||||||||||||||||||||||
003125 | 32E+01 | ||||||||||||||||||||||||
00303030303 | 33E+01 | ||||||||||||||||||||||||
00294117647 | 34E+01 | ||||||||||||||||||||||||
00285714286 | 35E+01 | ||||||||||||||||||||||||
00277777778 | 36E+01 | ||||||||||||||||||||||||
0027027027 | 37E+01 | ||||||||||||||||||||||||
00263157895 | 38E+01 | ||||||||||||||||||||||||
00256410256 | 39E+01 | ||||||||||||||||||||||||
0025 | 40E+01 | ||||||||||||||||||||||||
00243902439 | 41E+01 | ||||||||||||||||||||||||
00238095238 | 42E+01 | ||||||||||||||||||||||||
0023255814 | 43E+01 | ||||||||||||||||||||||||
00227272727 | 44E+01 | ||||||||||||||||||||||||
00222222222 | 45E+01 | ||||||||||||||||||||||||
00217391304 | 46E+01 | ||||||||||||||||||||||||
00212765957 | 47E+01 | ||||||||||||||||||||||||
00208333333 | 48E+01 | ||||||||||||||||||||||||
00204081633 | 49E+01 | ||||||||||||||||||||||||
002 | 50E+01 | ||||||||||||||||||||||||
00196078431 | 51E+01 | ||||||||||||||||||||||||
00192307692 | 52E+01 | ||||||||||||||||||||||||
00188679245 | 53E+01 | ||||||||||||||||||||||||
00185185185 | 54E+01 | ||||||||||||||||||||||||
00181818182 | 55E+01 | ||||||||||||||||||||||||
00178571429 | 56E+01 | ||||||||||||||||||||||||
00175438596 | 57E+01 | ||||||||||||||||||||||||
00172413793 | 58E+01 | ||||||||||||||||||||||||
00169491525 | 59E+01 | ||||||||||||||||||||||||
Average | 007904 | hours | 300E+01 | ||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||
MTBF | 0 | hours | |||||||||||||||||||||||
Standard deviation | 0 | hours | |||||||||||||||||||||||
13s | 0 | hours | |||||||||||||||||||||||
MTBF-13s | - 0 | hours | |||||||||||||||||||||||
l90 | -89E+00 | per hour | |||||||||||||||||||||||
Mean | 13E+01 | per hour | |||||||||||||||||||||||
SD | 68E+00 | per hour | |||||||||||||||||||||||
13SD | 88E+00 | per hour | |||||||||||||||||||||||
Mean + 13SD | 21E+01 | per hour | |||||||||||||||||||||||
Min | Max | 30 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Number | inverse | - 0 | 6000 | 6000 | 200 | 200 | $000 | 30 | 12 | ||||||||||||||||||
1 | 10E+00 | assume min-max is 5 std dev | |||||||||||||||||||||||||
2 | 50E-01 | $000 | |||||||||||||||||||||||||
3 | 33E-01 | Bin Bottom | Bin Label | No of Failures | Normalised | ||||||||||||||||||||||
4 | 25E-01 | - 0 | - 0 | 0 | 00014606917 | ||||||||||||||||||||||
5 | 20E-01 | 2 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
6 | 17E-01 | 4 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
7 | 14E-01 | 6 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
8 | 13E-01 | 8 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
9 | 11E-01 | 10 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
10 | 10E-01 | 12 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
11 | 91E-02 | 14 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
12 | 83E-02 | 16 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
13 | 77E-02 | 18 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
14 | 71E-02 | 20 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
15 | 67E-02 | 22 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
16 | 63E-02 | 24 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
17 | 59E-02 | 26 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
18 | 56E-02 | 28 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
19 | 53E-02 | 30 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
20 | 50E-02 | 32 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
21 | 48E-02 | 34 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
22 | 45E-02 | 36 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
23 | 43E-02 | 38 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
24 | 42E-02 | 40 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
25 | 40E-02 | 42 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
26 | 38E-02 | 44 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
27 | 37E-02 | 46 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
28 | 36E-02 | 48 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
29 | 34E-02 | 50 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
30 | 33E-02 | 52 | - 0 | 2 | 00014606917 | ||||||||||||||||||||||
31 | 32E-02 | ||||||||||||||||||||||||||
32 | 31E-02 | ||||||||||||||||||||||||||
33 | 30E-02 | ||||||||||||||||||||||||||
34 | 29E-02 | ||||||||||||||||||||||||||
35 | 29E-02 | ||||||||||||||||||||||||||
36 | 28E-02 | ||||||||||||||||||||||||||
37 | 27E-02 | ||||||||||||||||||||||||||
38 | 26E-02 | ||||||||||||||||||||||||||
39 | 26E-02 | ||||||||||||||||||||||||||
40 | 25E-02 | ||||||||||||||||||||||||||
41 | 24E-02 | ||||||||||||||||||||||||||
42 | 24E-02 | ||||||||||||||||||||||||||
43 | 23E-02 | ||||||||||||||||||||||||||
44 | 23E-02 | ||||||||||||||||||||||||||
45 | 22E-02 | ||||||||||||||||||||||||||
46 | 22E-02 | ||||||||||||||||||||||||||
47 | 21E-02 | ||||||||||||||||||||||||||
48 | 21E-02 | ||||||||||||||||||||||||||
49 | 20E-02 | ||||||||||||||||||||||||||
50 | 20E-02 | ||||||||||||||||||||||||||
51 | 20E-02 | ||||||||||||||||||||||||||
52 | 19E-02 | ||||||||||||||||||||||||||
53 | 19E-02 | ||||||||||||||||||||||||||
54 | 19E-02 | ||||||||||||||||||||||||||
55 | 18E-02 | ||||||||||||||||||||||||||
56 | 18E-02 | ||||||||||||||||||||||||||
57 | 18E-02 | ||||||||||||||||||||||||||
58 | 17E-02 | ||||||||||||||||||||||||||
59 | 17E-02 | ||||||||||||||||||||||||||
Average | 30 | hours | 790E-02 | ||||||||||||||||||||||||
33E-02 | |||||||||||||||||||||||||||
Period | - 0 | days | |||||||||||||||||||||||||
- 0 | hours | ||||||||||||||||||||||||||
Population | 1000 | devices | |||||||||||||||||||||||||
Time in service | - 0 | Device-hours | |||||||||||||||||||||||||
Number of failures | - 0 | ||||||||||||||||||||||||||
MTBF | ERRORDIV0 | hours | ERRORDIV0 | years | |||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | |||||||||||||||||||||||||
Confidence level | |||||||||||||||||||||||||||
l90 | ERRORDIV0 | per hour | 90 | ||||||||||||||||||||||||
Check mean failure rate using χsup2 | |||||||||||||||||||||||||||
l50 | ERRORDIV0 | per hour | 50 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | |||||||||||||||||||||||||||
MTBF | 30 | hours | |||||||||||||||||||||||||
Standard deviation | 17 | hours | |||||||||||||||||||||||||
13s | 22 | hours | |||||||||||||||||||||||||
MTBF-13s | 8 | hours | |||||||||||||||||||||||||
l90 | 13E-01 | per hour | |||||||||||||||||||||||||
Mean | 33E-02 | per hour | |||||||||||||||||||||||||
SD | 58E-02 | per hour | |||||||||||||||||||||||||
13SD | 76E-02 | per hour | |||||||||||||||||||||||||
Mean + 13SD | 11E-01 | per hour | |||||||||||||||||||||||||
Time between failures | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||||||
Start | 1110 | device-hours | - 0 | - 0 | 000 | $000 | 536133 | - 0 | ||||||||||||||||||||||
Failure | 1510 | 962281 | 10E-06 | ERRORDIV0 | 96E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | assume min-max is 5 std dev | |||||||||||||||||||||
Failure | 1710 | 648197 | 15E-06 | ERRORDIV0 | 65E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ERRORDIV0 | |||||||||||||||||||||
Failure | 11110 | 814662 | 12E-06 | ERRORDIV0 | 81E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 11110 | 141452 | 71E-06 | ERRORDIV0 | 14E+05 | ERRORDIV0 | 71E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11510 | 1004475 | 10E-06 | ERRORDIV0 | 10E+06 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11810 | 622443 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 11910 | 325476 | 31E-06 | ERRORDIV0 | 33E+05 | ERRORDIV0 | 31E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12010 | 222575 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12210 | 508397 | 20E-06 | ERRORDIV0 | 51E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12510 | 571250 | 18E-06 | ERRORDIV0 | 57E+05 | ERRORDIV0 | 18E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12610 | 394182 | 25E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 25E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 12910 | 644171 | 16E-06 | ERRORDIV0 | 64E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13010 | 124490 | 80E-06 | ERRORDIV0 | 12E+05 | ERRORDIV0 | 80E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 13110 | 425465 | 24E-06 | ERRORDIV0 | 43E+05 | ERRORDIV0 | 24E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2410 | 954421 | 10E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2610 | 492722 | 20E-06 | ERRORDIV0 | 49E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2810 | 482169 | 21E-06 | ERRORDIV0 | 48E+05 | ERRORDIV0 | 21E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 45598 | 22E-05 | ERRORDIV0 | 46E+04 | ERRORDIV0 | 22E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 2910 | 174599 | 57E-06 | ERRORDIV0 | 17E+05 | ERRORDIV0 | 57E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 153784 | 65E-06 | ERRORDIV0 | 15E+05 | ERRORDIV0 | 65E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21010 | 75110 | 13E-05 | ERRORDIV0 | 75E+04 | ERRORDIV0 | 13E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21410 | 912650 | 11E-06 | ERRORDIV0 | 91E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 664521 | 15E-06 | ERRORDIV0 | 66E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 21710 | 37380 | 27E-05 | ERRORDIV0 | 37E+04 | ERRORDIV0 | 27E-05 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22010 | 759032 | 13E-06 | ERRORDIV0 | 76E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22210 | 534288 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 22510 | 615725 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3110 | 949343 | 11E-06 | ERRORDIV0 | 95E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3510 | 927866 | 11E-06 | ERRORDIV0 | 93E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3610 | 225532 | 44E-06 | ERRORDIV0 | 23E+05 | ERRORDIV0 | 44E-06 | ERRORDIV0 | - 0 | 0 | ERRORNUM | |||||||||||||||||||
Failure | 3710 | 391125 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 3810 | 271703 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31010 | 463530 | 22E-06 | ERRORDIV0 | 46E+05 | ERRORDIV0 | 22E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31310 | 691314 | 14E-06 | ERRORDIV0 | 69E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31410 | 273336 | 37E-06 | ERRORDIV0 | 27E+05 | ERRORDIV0 | 37E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 31810 | 772942 | 13E-06 | ERRORDIV0 | 77E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32210 | 972713 | 10E-06 | ERRORDIV0 | 97E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32510 | 850558 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 32710 | 528257 | 19E-06 | ERRORDIV0 | 53E+05 | ERRORDIV0 | 19E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4110 | 987792 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4310 | 590263 | 17E-06 | ERRORDIV0 | 59E+05 | ERRORDIV0 | 17E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4510 | 385461 | 26E-06 | ERRORDIV0 | 39E+05 | ERRORDIV0 | 26E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4610 | 224526 | 45E-06 | ERRORDIV0 | 22E+05 | ERRORDIV0 | 45E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 4810 | 679543 | 15E-06 | ERRORDIV0 | 68E+05 | ERRORDIV0 | 15E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 845334 | 12E-06 | ERRORDIV0 | 85E+05 | ERRORDIV0 | 12E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41210 | 24362 | 41E-05 | ERRORDIV0 | 24E+04 | ERRORDIV0 | 41E-05 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41510 | 621767 | 16E-06 | ERRORDIV0 | 62E+05 | ERRORDIV0 | 16E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41810 | 788953 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 41910 | 366532 | 27E-06 | ERRORDIV0 | 37E+05 | ERRORDIV0 | 27E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42310 | 780944 | 13E-06 | ERRORDIV0 | 78E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42410 | 251133 | 40E-06 | ERRORDIV0 | 25E+05 | ERRORDIV0 | 40E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42710 | 786268 | 13E-06 | ERRORDIV0 | 79E+05 | ERRORDIV0 | 13E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 42810 | 207976 | 48E-06 | ERRORDIV0 | 21E+05 | ERRORDIV0 | 48E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5210 | 985398 | 10E-06 | ERRORDIV0 | 99E+05 | ERRORDIV0 | 10E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5510 | 698150 | 14E-06 | ERRORDIV0 | 70E+05 | ERRORDIV0 | 14E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 5910 | 874164 | 11E-06 | ERRORDIV0 | 87E+05 | ERRORDIV0 | 11E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 499082 | 20E-06 | ERRORDIV0 | 50E+05 | ERRORDIV0 | 20E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51110 | 101503 | 99E-06 | ERRORDIV0 | 10E+05 | ERRORDIV0 | 99E-06 | ERRORDIV0 | ||||||||||||||||||||||
Failure | 51210 | 302989 | 33E-06 | ERRORDIV0 | 30E+05 | ERRORDIV0 | 33E-06 | ERRORDIV0 | ||||||||||||||||||||||
End | 93015 | Average | 536133 | hours | 408E-06 | ERRORDIV0 | 536E+05 | ERRORDIV0 | 408E-06 | ERRORDIV0 | ||||||||||||||||||||
19E-06 | per hour | |||||||||||||||||||||||||||||
Period | 2098 | days | ||||||||||||||||||||||||||||
50352 | hours | |||||||||||||||||||||||||||||
Population | 10000 | devices | ||||||||||||||||||||||||||||
Time in service | 503520000 | Device-hours | ||||||||||||||||||||||||||||
Number of failures | 1000 | 59 | 59 | 59 | 59 | 59 | 59 | |||||||||||||||||||||||
MTBF | 503520 | hours | 57 | years | ||||||||||||||||||||||||||
l50 | 20E-06 | per hour | ||||||||||||||||||||||||||||
Confidence level | ||||||||||||||||||||||||||||||
l90 | 21E-06 | per hour | 90 | 10410605082 | ERRORVALUE | 18 | ERRORDIV0 | ERRORDIV0 | ERRORVALUE | |||||||||||||||||||||
Check mean failure rate using χsup2 | ||||||||||||||||||||||||||||||
l50 | 20E-06 | per hour | 50 | |||||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||||
MTBF | 536133 | hours | ||||||||||||||||||||||||||||
Standard deviation | 294906 | hours | ||||||||||||||||||||||||||||
13s | 383378 | hours | ||||||||||||||||||||||||||||
MTBF-13s | 152756 | hours | ||||||||||||||||||||||||||||
l90 | 65E-06 | per hour | ||||||||||||||||||||||||||||
Mean | 19E-06 | per hour | ||||||||||||||||||||||||||||
SD | 34E-06 | per hour | ||||||||||||||||||||||||||||
13SD | 44E-06 | per hour | ||||||||||||||||||||||||||||
Mean + 13SD | 63E-06 | per hour | ||||||||||||||||||||||||||||
Population | 100 | 100 | 100 | 100 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | ||||||||||||||||||
Time in service hours | 50000 | 50000 | 50000 | 50000 | 1000 | 1000 | 1000 | 1000 | 10000 | 10000 | 10000 | 10000 | ||||||||||||||||||
Device-hours | 5000000 | 5000000 | 5000000 | 5000000 | 10000 | 10000 | 10000 | 10000 | 1000000 | 1000000 | 1000000 | 1000000 | ||||||||||||||||||
Number of failures | - 0 | 1 | 10 | 100 | 1 | 10 | 100 | 4 | - 0 | 1 | 10 | 100 | ||||||||||||||||||
MTBF hours | - 0 | 5000000 | 500000 | 50000 | 10000 | 1000 | 100 | 2500 | - 0 | 1000000 | 100000 | 10000 | ||||||||||||||||||
lAVG per hour | - 0 | 20E-07 | 20E-06 | 20E-05 | 10E-04 | 10E-03 | 10E-02 | 40E-04 | - 0 | 10E-06 | 10E-05 | 10E-04 | ||||||||||||||||||
05 | l50 per hour | 14E-07 | 34E-07 | 21E-06 | 20E-05 | 17E-04 | 11E-03 | 10E-02 | 47E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
07 | l70 per hour | 24E-07 | 49E-07 | 25E-06 | 21E-05 | 24E-04 | 12E-03 | 11E-02 | 59E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
09 | l90 per hour | 46E-07 | 78E-07 | 31E-06 | 23E-05 | 39E-04 | 15E-03 | 11E-02 | 80E-04 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | |||||||||||||||||
l90 l50 = | 33 | 23 | 14 | 11 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||||||
l90 l70 = | 19 | 16 | 12 | 11 | 16 | 12 | 11 | 14 | - 0 | ERRORVALUE | ERRORVALUE | ERRORVALUE | ||||||||||||||||||
48784 | 249390 | 2120378 |
Time between failures | Min | Max | 25 | steps | step size | bottom bin | mean | Stddev | ||||||||||||||||||||
Start | 1110 | device-hours | 526 | 665 | 139 | 006 | 006 | $550 | 6 | 0 | ||||||||||||||||||
Failure | 5310 | 2949106 | 34E-07 | 64696904217 | -64696904217 | assume min-max is 5 std dev | ||||||||||||||||||||||
Failure | 82110 | 2632591 | 38E-07 | 64203833357 | -64203833357 | $000 | ||||||||||||||||||||||
Failure | 113010 | 2428252 | 41E-07 | 63852938381 | -63852938381 | Bin Bottom | Bin Label | No of Failures | Normalised | |||||||||||||||||||
Failure | 42711 | 3550948 | 28E-07 | 65503443066 | -65503443066 | 5 | 518 | 0 | 00003321716 | |||||||||||||||||||
Failure | 72111 | 2029839 | 49E-07 | 63074616502 | -63074616502 | 5 | 524 | 1 | 00007853508 | |||||||||||||||||||
Failure | 121411 | 3514339 | 28E-07 | 65458436058 | -65458436058 | 5 | 530 | 0 | 00017721668 | |||||||||||||||||||
Failure | 2312 | 1224713 | 82E-07 | 60880342113 | -60880342113 | 5 | 536 | 0 | 00038166742 | |||||||||||||||||||
Failure | 5412 | 2185019 | 46E-07 | 63394551674 | -63394551674 | 5 | 542 | 0 | 00078452211 | |||||||||||||||||||
Failure | 82012 | 2589443 | 39E-07 | 64132063897 | -64132063897 | 6 | 548 | 0 | 00153909314 | |||||||||||||||||||
Failure | 10812 | 1173440 | 85E-07 | 60694607774 | -60694607774 | 6 | 554 | 0 | 00288180257 | |||||||||||||||||||
Failure | 101612 | 182448 | 55E-06 | 52611385494 | -52611385494 | 6 | 560 | 0 | 00514995168 | |||||||||||||||||||
Failure | 42013 | 4467842 | 22E-07 | 66500977884 | -66500977884 | 6 | 566 | 0 | 00878378494 | |||||||||||||||||||
Failure | 9613 | 3344062 | 30E-07 | 65242743034 | -65242743034 | 6 | 572 | 0 | 01429880827 | |||||||||||||||||||
Failure | 1314 | 2843747 | 35E-07 | 64538909553 | -64538909553 | 6 | 578 | 0 | 02221557716 | |||||||||||||||||||
Failure | 41514 | 2461238 | 41E-07 | 63911536882 | -63911536882 | 6 | 584 | 0 | 03294237965 | |||||||||||||||||||
Failure | 91814 | 3746271 | 27E-07 | 65735991807 | -65735991807 | 6 | 590 | 0 | 04662211211 | |||||||||||||||||||
Failure | 12514 | 1875103 | 53E-07 | 6273025093 | -6273025093 | 6 | 596 | 1 | 06297505128 | |||||||||||||||||||
Failure | 5415 | 3597103 | 28E-07 | 65559528244 | -65559528244 | 6 | 602 | 0 | 08118666926 | |||||||||||||||||||
Failure | 62915 | 1342319 | 74E-07 | 6127855577 | -6127855577 | 6 | 608 | 2 | 0998942587 | |||||||||||||||||||
Failure | 8415 | 858503 | 12E-06 | 59337415921 | -59337415921 | 6 | 614 | 1 | 11731024489 | |||||||||||||||||||
End | 93015 | Average | 2449816 | hours | 723E-07 | avg | 63 | - 63 | 6 | 620 | 0 | 13148341142 | ||||||||||||||||
6 | 626 | 1 | 14065189732 | |||||||||||||||||||||||||
41E-07 | per hour | sd | 031368536 | 031368536 | 6 | 632 | 2 | 14360178406 | ||||||||||||||||||||
avg+13SD | 67 | -67244861308 | 6 | 638 | 2 | 13993091861 | ||||||||||||||||||||||
Period | 2098 | days | 6 | 644 | 3 | 13013890374 | ||||||||||||||||||||||
50352 | hours | MTBF90 | 5302567 | 5302567 | 7 | 650 | 2 | 11551548664 | ||||||||||||||||||||
Lambda90 | 189E-07 | 189E-07 | 7 | 656 | 4 | 09786173006 | ||||||||||||||||||||||
Population | 1000 | devices | 7 | 662 | 0 | 07912708659 | ||||||||||||||||||||||
Time in service | 50352000 | Device-hours | 189E-07 | 00000001886 | 7 | 668 | 1 | 06106285011 | ||||||||||||||||||||
Number of failures | 20 | 7 | 674 | 0 | 04497473114 | |||||||||||||||||||||||
MTBF | 2517600 | hours | ||||||||||||||||||||||||||
l50 | 40E-07 | per hour | ||||||||||||||||||||||||||
-67389046286 | ||||||||||||||||||||||||||||
-6718562883 | ||||||||||||||||||||||||||||
-67704516495 | ||||||||||||||||||||||||||||
-69038810225 | ||||||||||||||||||||||||||||
-68746281123 | ||||||||||||||||||||||||||||
Confidence level | -67719319389 | |||||||||||||||||||||||||||
l90 | 54E-07 | per hour | 90 | -62699281143 | -66371588608 | |||||||||||||||||||||||
Check mean failure rate using χsup2 | -66979903128 | |||||||||||||||||||||||||||
l50 | 41E-07 | per hour | 50 | 005 | ||||||||||||||||||||||||
Alternatively using mean and standard deviation to derive l90 | ||||||||||||||||||||||||||||
MTBF | 2449816 | hours | ||||||||||||||||||||||||||
Standard deviation | 1106903 | hours | ||||||||||||||||||||||||||
13s | 1438973 | hours | ||||||||||||||||||||||||||
MTBF-13s | 1010843 | hours | ||||||||||||||||||||||||||
l90 | 99E-07 | per hour | ||||||||||||||||||||||||||
Mean | 41E-07 | per hour | ||||||||||||||||||||||||||
SD | 90E-07 | per hour | ||||||||||||||||||||||||||
13SD | 12E-06 | per hour | ||||||||||||||||||||||||||
Mean + 13SD | 16E-06 | per hour | ||||||||||||||||||||||||||
Start | 1110 | |||||||
Failure | 93011 | |||||||
Failure | 6313 | |||||||
Failure | 11414 | |||||||
End | 93015 | |||||||
Period | 2098 | days | ||||||
50352 | hours | |||||||
Population | 100 | devices | ||||||
Time in service | 5035200 | Device-hours | ||||||
Number of failures | 3 | |||||||
MTBF | 1678400 | hours | ||||||
l50 | 60E-07 | per hour | ||||||
Confidence level | ||||||||
l90 | 13E-06 | per hour | 90 | |||||
Check mean failure rate using χsup2 | ||||||||
l50 | 73E-07 | per hour | 50 |
Can we be confidentUsers have collected so much data over many years in a variety of environments
With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years
Surely by now we must have recorded enough many failures for most commonly applied devices
λ90 asymp λ70 asymp λAVG
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Where can we find credible data
An enormous effort has been made to determine λSome widely used sources
bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook
lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software
and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Wide variation in reported λWhy is the variation so wide if λ is constant
OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate
λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
A long-standing debateShould the calculated probability of failure take into account systematic failures
The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure
ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
The intent of the standards
Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods
Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the
required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
What is randomWhich of these are random
bull Tossing a coinbull Horse racebull Football match
If I record 47 heads in 100 tosses P(head) asymp
If a football team wins 47 matches out of 100 what is the probability they will win their next match
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
What is randomDictionary
Made done or happening without method or conscious decision haphazard
MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events
These definitions are significantly different
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Guidance from ISOTR 12489
ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times
Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Random or systematic failuresPressure transmitters
bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice
bull due to lack of maintenance resources and accessbull treated as quasi-random
Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially
eg 10 chance that a device picked at random has failed
119875119875119875119875119875119875 119905119905 = 0
119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905
λt
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially
SIFs always require PFDavg lt 01 in this region the accumulation is linear
PFD asymp λDUt
With a proof test interval T the average is simplyPFDavg asymp λDUT 2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879
2
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements
With 1oo2 valves the PFD is approximately
119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2
3+
120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879
120573120573120582120582119863119863119863119863119879119879
2
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate
bull can be measuredbull may be predictable bull is not a fixed constant rate
Typical failure rates are easy to obtainFailure rates can be controlled and reduced
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum
Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum
Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum
These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets
(risk targets can only be in orders of magnitude)
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10
Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)
bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures
Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years
‒ Results in low PFD final elements might not dominate
Continuous process 119879119879 asymp 1 year
LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve
Production constraints may make it difficult to reduce 119879119879
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT
SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)
SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa
bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to
The values assumed in design set performance standards for availability and reliability in operation and maintenance
Achieving these performance standards depends on the design and on operation and maintenance practices
120573120573 120582120582119863119863119863119863 119879119879 and MTTR
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Monitoring performance is essentialOperators must monitor the failure rates and modes
bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures
Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for
bull inspectionbull testingbull maintenance and repair
Accessibility and testability must be considered and specified as design requirements
bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment
‒ preventing systematic failuresbull Volume of operating experience provides evidence that
‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable
(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Emphasis on prevention instead of calculationPrevent systematic failures through quality
bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions
Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility
Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
39
Credible reliability data
Performance targets must be feasible by design
equiv feasible performance targets
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1
PREVENT PREVENTABLE FAILURES
M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer
IampE Systems Pty Ltd
Perth Western Australia
Abstract
IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates
Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice
Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful
This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice
Summary
Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant
Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that
bull Failures are almost never purely random and as a result
bull Failure rates are not fixed and constant
At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful
Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks
Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices
Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2
purely random Therefore safety function reliability performance can be improved through four key strategies
1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices
2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage
3 Using risk-based inspection and condition-based maintenance techniques to
- Identify and then control conditions that induce early failures
- Actively prevent common-cause failures
4 Detailed analysis of all failures to
- Monitor failure rates
- Determine how failure recurrence may be prevented if failure rates need to be reduced
Functional safety depends on safety integrity
The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways
Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems
Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated
Calculation methods
IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]
Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions
bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates
bull Failures occurring within a population of devices are assumed to be independent events
ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates
329 catalectic failure sudden and complete failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3
Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely
Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)
If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population
The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components
Hardware fault tolerance
The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required
Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards
IEC 61508 provides two strategies for minimising the required hardware fault tolerance
bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)
bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)
A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value
IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment
Failure rate confidence level
If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4
The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70
Failure rate sources
The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures
The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure
The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar
Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band
Variability in failure rates
It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance
One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures
Random failure and systematic failure
It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo
lsquoMade done or happening without method or conscious decision haphazardrsquo
ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5
The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar
lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo
The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that
lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo
Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]
lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo
The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random
By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure
lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo
The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random
Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed
Degradation of mechanical components is not a purely random process Degradation can be monitored
While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6
The reasons for the wide variability in reported failure rates seem to be clear
bull Only a small proportion of failures occur at a fixed rate that cannot be changed
bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance
bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime
bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices
bull No clear distinction is made between systematic failure and random failure in the reported failures
Common cause failure
Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time
Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable
Factors that dominate PFD
In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures
As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by
119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2
3+ 120573120573
1205821205821198631198631198631198631198791198792
The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true
120573120573 ≪ 120582120582119863119863119863119863119879119879
If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve
For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2
The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7
Clearly the PFD for undetected failures is directly proportional to β λDU and T
For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)
Wrong but useful
The SINTEF Report A26922 includes the pertinent quote from George EP Box
lsquoEssentially all models are wrong but some are usefulrsquo
The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful
The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant
The published failure rates are useful as an indicator of failure rates that may be achieved in practice
The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision
Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate
The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible
The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation
Setting feasible targets
During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction
In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements
The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa
The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor
It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8
between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2
For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000
This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements
The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team
Independent certification of systematic capability
The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL
Independent certification provides good evidence of systematic capability in the design and construction of the equipment
Evidence of prior use
Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance
Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable
Measuring performance against benchmarks
IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour
The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9
Anticipating rather than measuring failures
It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured
Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required
A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components
Designing for testability and maintainability
A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance
For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown
It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided
If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)
Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation
Preventing systematic failures
Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented
Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10
appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps
The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures
Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded
Preventing lsquorandomrsquo failures
This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures
Conclusions Wide variation in failure data
Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal
Confusion between random and systematic
There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures
PFD Calculations set feasible performance benchmarks
PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11
Strategies for improving risk reduction
1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability
The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions
2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal
The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance
The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment
3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements
4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations
Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future
If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12
References
[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016
[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007
[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010
[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010
[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489
[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015
[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013
[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013
[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008
[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015
[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015
[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015
[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990
Recommended