Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Real-Time & Embedded Systems

Agenda

Safety Critical Systems

Project 6 continued

(c) Copyright 2012 Dr. Phillip A. LaPlante

Safety Critical Systems “Safe enough” looks different at 35,000 feet.

– Bruce Powell Douglass

“The Air Force has a perfect operating record … everything we put in the air has come back down.”

- Unknown


Ubiquity of Control Systems Electro-mechanical devices are migrating to software-

driven systems

Automobiles

Planes

Home Appliances

Medical Equipment

Nuclear Power Plants


Software Failures Therac-25

Radiation therapy device

Software-driven

Bugs allowed massive radiation overdoses

Killed 3 people, contributed to the death of a fourth


Software Failures Patriot Missiles

Clock drift reduced their effectiveness from 95% to 13%

Allowed a SCUD missile through defense perimeter

Killed 29, injured 97

Aegis tracking system

Failure contributed to shooting down an Iranian Airline flight

290 lives lost


Software Failures 8080-based factory control software

Mistakenly stacked large boulders 80 feet high

Crushed cars and damaged a building

Robotics

Stray EM interference blamed for 19 deaths

Cardiac pacemakers

Low-energy radiation reprogrammed

Caused several deaths


Software Failures Medical Database Software

Incorrectly informed woman she had incurable syphilis and had passed it on to her children

She strangled one, attempted to kill another and herself

Sunlight Filtering Software

Failed to remove false missile detections based on sunlight reflecting off clouds

A Soviet Commander averted nuclear war based on a “… funny feeling in my gut.”


Terms Reliability – the measure of up-time, or availability of a

system The probability that a task will complete before the system

fails

Measured in Mean Time Between Failures (MTBF)

Security – permitting access to only authorized and authenticated persons of systems

Safety – does not incur too much risk to person or property

Risk – the chance that something bad will happen

Common-mode failure – a single failure results in the failure of multiple control paths


Fundamental Hazards Release of energy

Release of toxins

Interference of life-support functions

Supplying misleading information to safety personnel or control systems

Failure to alarm when hazardous conditions arise

Failure to limit or act when unwanted events occur, inputs are flawed or outputs are outside correct levels


System Issues Safety is a system issue

Multiple solutions may address a concern

Interlocks

Redundant hardware

Redundant software

The interaction of the components determines the safety of the system


Software Failures Software does not fail

Failures represent a change in the capability of the system

Broken switch

Failed component

Bad sensor

If software does something wrong, it does it every time!

Software may respond poorly to failures


Single-point Failures A device is considered safe if a single failure in the

system does not result in an unsafe condition

Single-point assessments tree:


Fail-Safe State A condition a safety-critical system must attain with

an unrecoverable fault.

Emergency Stop

Partial Shutdown

Hold

Manual Control

Restart

Driven by the problem domain needs


Fail-Safe states An airliner jet engine fails?

Unmanned space vehicle launch?

Attended medical devices?

Hazardous area robotics?

Unmanned aircraft control failure?

Cruise ship rudder failure?


Achieving Safety Separation of safety channels from non-safety

channels

Firewall pattern

Any component failure in the channel fails the entire channel

Isolation of safety systems from non-safety systems is common and justifiable

Redundancy

Small or large scale

Homogenous or diverse


Achieving Safety Homogenous

Channels are replicated verbatim

Detects only faults, not errors

Inexpensive

Diverse

A different channel is implemented

Detects faults and errors

More expensive


Achieving Safety Diverse redundancy is stronger

Protects against systemic faults / errors

Data corruption detection

Parity bit

Hamming codes (parity bits)

Checksums

CRCs

Redundant storage


Achieving Safety Reasonableness checks

A second algorithm validating the results of the first

Usually much simpler

Feedback error detection Identify potential fault conditions

May cause a fail-safe transition

Feedback error correction Identify and correct potential fault conditions

Attempts to keep the system operating, and may reduce capability


Safety Architectures Single-Channel Protected Design

A single flow of control

A break in the channel induces a failure

Safeguards are added to ensure correct fail-safe behavior

A single point of failure

Multi-channel Voting Pattern An odd number of redundant channels

Each channel “votes” on the task

Majority rules

Homogenous or diverse


Safety Architectures Homogenous Redundancy Pattern

Identical channels run in parallel

If an odd number of channels:

Majority channels detect and correct minority channels

Must be fully redundant

Inexpensive to implement

Detects only faults, not errors

May be expensive due to redundant hardware


Safety Architectures Diverse Redundancy Pattern

Redundant, but uniquely implemented channels

Different but equal

Lightweight redundancy

Separation of monitoring and actuation


Safety Architectures Watchdog Pattern

A secondary process monitors the primary process

Primary process periodically “feeds” the secondary process

Secondary process can alarm or restart should the primary process fail

May include a periodic test suite


Safety Architectures Safety Executive Pattern

A centralized coordinator for monitoring safety

A really smart watchdog

Watchdog timeouts

Software error assertions

Continuous or periodic built-in tests

Faults indentified by monitors


Safety Architecture Monitor-actuator pattern

Separation of algorithms

Actuation performs the actions

Monitoring tracks the actions

Additional cost and complexity


Eight Steps to Safety Identify the hazards

Determine the risks

Dfine the safety measures

Create safe requirements

Create safe designs

Implement safety

Assure the safety process

Test, test, test (Peer Reviews!)


Identify the Hazards Identify the hazard

Determine the level of risk

Determine the tolerance time

Determine the source of the hazrd: The fault leading to the hazard

The likelihood of the fault

The fault detection time

The means by which the hazard is handled: The means

The fault reaction (exposure time)


Identify the Hazards Patient Ventilator Example:


Fault Analysis Fault-tree analysis (FTA)

Identify the hazards

Work backward from the hazard to identify the causal conditions

Diagram with a boolean flow chart

UML Activity diagram

Failure mode effect analysis (FMEA)

Identify potential faults

Work forward to the consequences


Determine the Risks FDA levels of concern

Minor – not expected to result in injury or death

Moderate – results in minor to moderate injury

Major – result in major injury or death

German TUV characterization

(S) Severity of the risk

(E) Duration of the period of exposure

(G) Prevention of the danger

(W) Probability of occurrence


Determine the Risks German TUV characterization


Determine the Risks German TUV Example


Define the Safety Measure Obviation – make the hazard physically impossible

Education – User training

Alarming – Announce the haard so action can be taken

Interlocks – removed via secondary device or logic to interceded

Internal Checking – the system detects and handles the malfunction prior to an incident

Safety Equipment – goggles, gloves, etc

Restriction of access – access to potential hazards is restricted to trained personnel

Labeling – High Voltage, do not touch


Create Safe Requirements Consider the requirements from a safety perspective

Specify the negations

The system shall not move hardware before user input


Create Safe Designs Work from safe requirements

Adopt a safe architecture

Revisit, revise the hazard analysis during development

Select measures that provide appropriate levels of detection and correction

Ensure independent channels lack common-mode failures

Adopt consistent strategies for handling faults

Include POST and periodic run-time tests


Implementing Safety Language Choice

Strong compile-time checking

Strong run-time checking

Support for encapsulation and abstration (but not “just because”)

Exception handling

“Safe” language constructs

Void*?


Assure the Safety Process Continuously track against hazard analysis

Utilize peer reviews to assure quality

Verify design adherence

Verify coding standards

Identify how each hazard is handled


Test, test, test Black box testing

White box testing

Monkey testing

Fault seeding

Load testing

Simulations

System testing

Unit testing


Documents

Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an