Upload
frank-gielen
View
282
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Vakgroep Informatietechnologie – IBCN
Software Architecture
Quality Attributes & Tactics (4)
Availability
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2
Availability
Faults & Failures : Faults become failures if not corrected or masked. A failure is observable by the system user; a fault not.
Areas of concern: Fault detection and frequency Reduced operations Recovery and Prevention
Availability is about system failure and its consequences.
Availability = MTBF
MTBF + MTTR
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 3
Availability Generic Scenario
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 4
Availability generic scenario (1/4)
Source of stimulus: ……….. who or what ? We differentiate between internal and external indications of faults or failure
since the desired system response may be different.
Stimulus: …………………does something ?A fault of one of the following classes occurs.
Omission. A component fails to respond to an input. Crash. The component repeatedly suffers omission faults. Timing. A component responds but the response is early or late. Bad response. A component responds with an incorrect value.
Artifact: …………. to the system or part of it ?This specifies the resource that is required to be highly available
Processor, Communication channel, Process, Storage.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 5
Availability generic scenario (2/4)
Environment: …….under certain conditions The state of the system affects the desired system response. Normal mode: if this is the first fault observed, some degradation of
response time or function may be preferred Degraded mode: if the system has already seen some faults it may
be desirable to shut it down totally. Overload mode:
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 6
Availability generic scenario (3/4)
Response: ………how the system reacts ?The System should detect the event & : Record it Notify appropriate parties, including the user and other
systems Disable sources of events that cause fault or failure
according to defined rules be unavailable for a specified interval, where interval
depends on criticality of system Continue to operate in normal or degraded mode
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 7
Availability generic scenario (4/4)
Response Measure…how can you measure this ? Time interval when the system must be available Availability time Time interval in which system can be in degraded mode Repair time
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 8
Availability Specific Scenario
“An unanticipated external message (DOS attack) is received by a process during normal operation. The process logs the receipt of the message, notifies the operator and continues with no downtime”
Case: Digital Signage – Public Transport
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 9
Availability QAS :
Q: What is the architectural impact of this requirement ?
SOURCE who or what A random event
STIMULUS does something ... causes a failure
ARTIFACT to the system or part of it ... to the communication system
ENVIRONMENT under certain conditions ...during normal operations
RESPONSE how the system reacts All displays must start showing scheduled arrival times for all buses
MEASURE how you can measure this ... Within 30 seconds of failure detection
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 10
Availability Tactics
Tactics
to Control
AvailabilityFault Fault Masked
or Repaired
Fault Detection Echo Heartbeat Exceptions
Fault Recovery Preparing for recovery Accomplishing the recovery
Fault Prevention
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 11
Fault Recovery Tactics (1/4)
Voting Tactic: Processes running on redundant processors each take the
input, compute and report the results to the “vote-counter.” Majority rules Preferred Component
Preferred component: This corrects faulty operation of components, algorithms or
processors. The more severe the consequences of failures the more stringent
the effort to ensure that the redundancy is independent. – Separate processors, separate implementation teams, … dissimilar
platforms
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 12
Fault Recovery Tactics (2/4)
Active redundancy (hot restart): All redundant components respond to events in parallel Redundant components synchronized at start then first
to return is the answer. This covers some faults. A faulty processor will be
slower to respond. When a failure occurs the downtime is usually only
milliseconds (switching to another component). Often used in client-server applications involving back-
end databases. In high availability for LANs the redundancy may be
separate paths so that failure of a bridge or router is not fatal. Note the synchronization demands here.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 13
Fault Recovery Tactics (3/4)
Passive Redundancy: One component responds to events and informs the standbys
of state updates. Upon failure the system must:
Ensure that the backup is sufficiently fresh. Restart points, checkpoints, log points ??? Remap the system to switch which system is the active
component. Often used in control systems
Example : Air traffic Control Chapter 6: Air Traffic Control: A Case Study in
Designing for High Availability
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 14
Fault Recovery Tactics (4/4)
Switchovers Upon failure or Periodic
Synchronization: is the responsibility of the primary component, broadcasting
synchronization signals to the redundant components.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 15
Fault Prevention Tactics
Removal from service To perform some preventive actions, e.g.,
rebooting to prevent slow memory leaks from causing problems
Transactions the bundling of a sequence of steps so that
they can be done all at once
Process monitor Once a fault in a process is detected;
remove–reinstantiate-reinitialize state
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 16
Availability Tactics Hierarchy
Availability
Ping/echoHeartbeatException
VotingActive red.Passive red.Spare
Removal from Service
TransactionsProcess Monitor
ShadowState resync.Rollback
RecoveryPreparationand repair
Fault detection RecoveryReintroduction
Prevention
FaultMasked
or Repaired
FaultArrives