25
Availability and reliability, 2013 Slide 1 Availability and Reliability

availability and reliability.pptx

Embed Size (px)

Citation preview

Reliability and Availability

Availability and ReliabilityAvailability and reliability, 2013Slide #Principal dependability properties

Availability and reliability, 2013Slide #ReliabilityThe probability of failure-free system operation over a specified time in a given environment for a given purposeAvailability and reliability, 2013Slide #AvailabilityThe probability that a system, at a point in time, will be operational and able to deliver the requested services

Availability and reliability, 2013Slide #Availability specificationBoth reliability and availability attributes can be expressed as numbers:Availability of 0.999 means that the system is up and running for 99.9% of the time;

Availability and reliability, 2013Slide #Reliability specificationProbability of failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way

Availability and reliability, 2013Slide #Availability and reliabilityAvailability and reliability are closely relatedObviously if a system is unavailable it is not delivering the specified system services.

Availability and reliability, 2013Slide #However, it is possible to have systems with low reliability that must be available. So long as system failures can be repaired quickly and does not damage data, some system failures may not be a problem.

Availability and reliability, 2013Slide #Availability is therefore best considered as a separate attribute reflecting whether or not the system can deliver its services.Availability takes repair time into account, if the system has to be taken out of service to repair faults.

Availability and reliability, 2013Slide #Availability perceptionAvailability is usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%.Availability and reliability, 2013Slide #

Availability and reliability, 2013Slide #Subjective availabilityThe number of users affected by the service outage. Loss of service in the middle of the night is less important for many systems than loss of service during peak usage periods.

Availability and reliability, 2013Slide #The length of the outage. The longer the outage, the more the disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem.

Availability and reliability, 2013Slide #Reliability metricsProbability of failure on demand (POFOD)Probability that a system will not deliver a service correctly when requestedUsed for systems where demands are infrequent and intermittentAvailability and reliability, 2013Slide #Rate of occurrence of failure (ROCOF)Number of system failures in a given time periodUsed for transaction processing systems with frequent and regular transactions

Availability and reliability, 2013Slide #FaultA characteristic of a software system that can lead to a system error. ErrorAn erroneous system state that can lead to system behavior that is unexpected by system users. FailureAn event that occurs at some point in time when the system does not deliver a service as expected by its users.

Availability and reliability, 2013Slide #Faults-errors-failuresFaultErrorFailureAvailability and reliability, 2013Slide #Faults and failuresFailures are a usually a result of system errors.The incorrect state causes undesirable system behaviourIncorrect state is a consequence of executing faulty codeAvailability and reliability, 2013Slide #However, faults do not necessarily result in system errorsThe erroneous system state resulting from the fault may be transient and corrected before an error arises.The faulty code may never be executed.

Availability and reliability, 2013Slide #Errors do not necessarily lead to system failuresThe error can be corrected by built-in error detection and recovery The failure can be protected against by built-in protection facilities. These may, for example, protect system resources from system errors

Availability and reliability, 2013Slide #Reliability achievementFault avoidanceDevelopment technique are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults.Availability and reliability, 2013Slide #Fault detection and removalVerification and validation techniques that increase the probability of detecting and correcting errors before the system goes into service are used.

Availability and reliability, 2013Slide #Fault toleranceRun-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures.

Availability and reliability, 2013Slide #SummaryAvailability is the probability that a system will be available when a service request is madeReliability is the probablity that a system will deliver a service as expected by usersAvailability and reliability, 2013Slide #SummarySoftware faults lead to state errors lead to operational failuresFault avoidance, detection and tolerance are strategies for achieving reliability

Availability and reliability, 2013Slide #