Sa 007 availability

Vakgroep Informatietechnologie – IBCN

Software Architecture

Quality Attributes & Tactics (4)

Availability

Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2

Availability

Faults & Failures : Faults become failures if not corrected or masked. A failure is observable by the system user; a fault not.

Areas of concern: Fault detection and frequency Reduced operations Recovery and Prevention

Availability is about system failure and its consequences.

Availability = MTBF

MTBF + MTTR


Availability Generic Scenario


Availability generic scenario (1/4)

Source of stimulus: ……….. who or what ? We differentiate between internal and external indications of faults or failure

since the desired system response may be different.

Stimulus: …………………does something ?A fault of one of the following classes occurs.

Omission. A component fails to respond to an input. Crash. The component repeatedly suffers omission faults. Timing. A component responds but the response is early or late. Bad response. A component responds with an incorrect value.

Artifact: …………. to the system or part of it ?This specifies the resource that is required to be highly available

Processor, Communication channel, Process, Storage.



Environment: …….under certain conditions The state of the system affects the desired system response. Normal mode: if this is the first fault observed, some degradation of

response time or function may be preferred Degraded mode: if the system has already seen some faults it may

be desirable to shut it down totally. Overload mode:



Response: ………how the system reacts ?The System should detect the event & : Record it Notify appropriate parties, including the user and other

systems Disable sources of events that cause fault or failure

according to defined rules be unavailable for a specified interval, where interval

depends on criticality of system Continue to operate in normal or degraded mode



Response Measure…how can you measure this ? Time interval when the system must be available Availability time Time interval in which system can be in degraded mode Repair time


Availability Specific Scenario

“An unanticipated external message (DOS attack) is received by a process during normal operation. The process logs the receipt of the message, notifies the operator and continues with no downtime”

Case: Digital Signage – Public Transport


Availability QAS :

Q: What is the architectural impact of this requirement ?

SOURCE who or what A random event

STIMULUS does something ... causes a failure

ARTIFACT to the system or part of it ... to the communication system

ENVIRONMENT under certain conditions ...during normal operations

RESPONSE how the system reacts All displays must start showing scheduled arrival times for all buses

MEASURE how you can measure this ... Within 30 seconds of failure detection


Availability Tactics

Tactics

to Control

AvailabilityFault Fault Masked

or Repaired

Fault Detection Echo Heartbeat Exceptions

Fault Recovery Preparing for recovery Accomplishing the recovery

Fault Prevention


Fault Recovery Tactics (1/4)

Voting Tactic: Processes running on redundant processors each take the

input, compute and report the results to the “vote-counter.” Majority rules Preferred Component

Preferred component: This corrects faulty operation of components, algorithms or

processors. The more severe the consequences of failures the more stringent

the effort to ensure that the redundancy is independent. – Separate processors, separate implementation teams, … dissimilar

platforms



Active redundancy (hot restart): All redundant components respond to events in parallel Redundant components synchronized at start then first

to return is the answer. This covers some faults. A faulty processor will be

slower to respond. When a failure occurs the downtime is usually only

milliseconds (switching to another component). Often used in client-server applications involving back-

end databases. In high availability for LANs the redundancy may be

separate paths so that failure of a bridge or router is not fatal. Note the synchronization demands here.



Passive Redundancy: One component responds to events and informs the standbys

of state updates. Upon failure the system must:

Ensure that the backup is sufficiently fresh. Restart points, checkpoints, log points ??? Remap the system to switch which system is the active

component. Often used in control systems

Example : Air traffic Control Chapter 6: Air Traffic Control: A Case Study in

Designing for High Availability



Switchovers Upon failure or Periodic

Synchronization: is the responsibility of the primary component, broadcasting

synchronization signals to the redundant components.


Fault Prevention Tactics

Removal from service To perform some preventive actions, e.g.,

rebooting to prevent slow memory leaks from causing problems

Transactions the bundling of a sequence of steps so that

they can be done all at once

Process monitor Once a fault in a process is detected;

remove–reinstantiate-reinitialize state


Availability Tactics Hierarchy

Availability

Ping/echoHeartbeatException

VotingActive red.Passive red.Spare

Removal from Service

TransactionsProcess Monitor

ShadowState resync.Rollback

RecoveryPreparationand repair

Fault detection RecoveryReintroduction

Prevention

FaultMasked

or Repaired

FaultArrives

Documents

Sa 007 availability