37
Real-Time & Embedded Systems Agenda Safety Critical Systems Project 6 continued (c) Copyright 2012 Dr. Phillip A. LaPlante

Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Embed Size (px)

Citation preview

Page 1: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Real-Time & Embedded Systems

Agenda

Safety Critical Systems

Project 6 continued

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 2: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Critical Systems “Safe enough” looks different at 35,000 feet.

– Bruce Powell Douglass

“The Air Force has a perfect operating record … everything we put in the air has come back down.”

- Unknown

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 3: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Ubiquity of Control Systems Electro-mechanical devices are migrating to software-

driven systems

Automobiles

Planes

Home Appliances

Medical Equipment

Nuclear Power Plants

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 4: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Software Failures Therac-25

Radiation therapy device

Software-driven

Bugs allowed massive radiation overdoses

Killed 3 people, contributed to the death of a fourth

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 5: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Software Failures Patriot Missiles

Clock drift reduced their effectiveness from 95% to 13%

Allowed a SCUD missile through defense perimeter

Killed 29, injured 97

Aegis tracking system

Failure contributed to shooting down an Iranian Airline flight

290 lives lost

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 6: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Software Failures 8080-based factory control software

Mistakenly stacked large boulders 80 feet high

Crushed cars and damaged a building

Robotics

Stray EM interference blamed for 19 deaths

Cardiac pacemakers

Low-energy radiation reprogrammed

Caused several deaths

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 7: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Software Failures Medical Database Software

Incorrectly informed woman she had incurable syphilis and had passed it on to her children

She strangled one, attempted to kill another and herself

Sunlight Filtering Software

Failed to remove false missile detections based on sunlight reflecting off clouds

A Soviet Commander averted nuclear war based on a “… funny feeling in my gut.”

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 8: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Terms Reliability – the measure of up-time, or availability of a

system The probability that a task will complete before the system

fails

Measured in Mean Time Between Failures (MTBF)

Security – permitting access to only authorized and authenticated persons of systems

Safety – does not incur too much risk to person or property

Risk – the chance that something bad will happen

Common-mode failure – a single failure results in the failure of multiple control paths

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 9: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Fundamental Hazards Release of energy

Release of toxins

Interference of life-support functions

Supplying misleading information to safety personnel or control systems

Failure to alarm when hazardous conditions arise

Failure to limit or act when unwanted events occur, inputs are flawed or outputs are outside correct levels

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 10: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

System Issues Safety is a system issue

Multiple solutions may address a concern

Interlocks

Redundant hardware

Redundant software

The interaction of the components determines the safety of the system

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 11: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Software Failures Software does not fail

Failures represent a change in the capability of the system

Broken switch

Failed component

Bad sensor

If software does something wrong, it does it every time!

Software may respond poorly to failures

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 12: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Single-point Failures A device is considered safe if a single failure in the

system does not result in an unsafe condition

Single-point assessments tree:

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 13: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Fail-Safe State A condition a safety-critical system must attain with

an unrecoverable fault.

Emergency Stop

Partial Shutdown

Hold

Manual Control

Restart

Driven by the problem domain needs

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 14: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Fail-Safe states An airliner jet engine fails?

Unmanned space vehicle launch?

Attended medical devices?

Hazardous area robotics?

Unmanned aircraft control failure?

Cruise ship rudder failure?

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 15: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Achieving Safety Separation of safety channels from non-safety

channels

Firewall pattern

Any component failure in the channel fails the entire channel

Isolation of safety systems from non-safety systems is common and justifiable

Redundancy

Small or large scale

Homogenous or diverse

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 16: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Achieving Safety Homogenous

Channels are replicated verbatim

Detects only faults, not errors

Inexpensive

Diverse

A different channel is implemented

Detects faults and errors

More expensive

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 17: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Achieving Safety Diverse redundancy is stronger

Protects against systemic faults / errors

Data corruption detection

Parity bit

Hamming codes (parity bits)

Checksums

CRCs

Redundant storage

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 18: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Achieving Safety Reasonableness checks

A second algorithm validating the results of the first

Usually much simpler

Feedback error detection Identify potential fault conditions

May cause a fail-safe transition

Feedback error correction Identify and correct potential fault conditions

Attempts to keep the system operating, and may reduce capability

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 19: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architectures Single-Channel Protected Design

A single flow of control

A break in the channel induces a failure

Safeguards are added to ensure correct fail-safe behavior

A single point of failure

Multi-channel Voting Pattern An odd number of redundant channels

Each channel “votes” on the task

Majority rules

Homogenous or diverse

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 20: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architectures Homogenous Redundancy Pattern

Identical channels run in parallel

If an odd number of channels:

Majority channels detect and correct minority channels

Must be fully redundant

Inexpensive to implement

Detects only faults, not errors

May be expensive due to redundant hardware

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 21: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architectures Diverse Redundancy Pattern

Redundant, but uniquely implemented channels

Different but equal

Lightweight redundancy

Separation of monitoring and actuation

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 22: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architectures Watchdog Pattern

A secondary process monitors the primary process

Primary process periodically “feeds” the secondary process

Secondary process can alarm or restart should the primary process fail

May include a periodic test suite

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 23: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architectures Safety Executive Pattern

A centralized coordinator for monitoring safety

A really smart watchdog

Watchdog timeouts

Software error assertions

Continuous or periodic built-in tests

Faults indentified by monitors

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 24: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Safety Architecture Monitor-actuator pattern

Separation of algorithms

Actuation performs the actions

Monitoring tracks the actions

Additional cost and complexity

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 25: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Eight Steps to Safety Identify the hazards

Determine the risks

Dfine the safety measures

Create safe requirements

Create safe designs

Implement safety

Assure the safety process

Test, test, test (Peer Reviews!)

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 26: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Identify the Hazards Identify the hazard

Determine the level of risk

Determine the tolerance time

Determine the source of the hazrd: The fault leading to the hazard

The likelihood of the fault

The fault detection time

The means by which the hazard is handled: The means

The fault reaction (exposure time)

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 27: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Identify the Hazards Patient Ventilator Example:

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 28: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Fault Analysis Fault-tree analysis (FTA)

Identify the hazards

Work backward from the hazard to identify the causal conditions

Diagram with a boolean flow chart

UML Activity diagram

Failure mode effect analysis (FMEA)

Identify potential faults

Work forward to the consequences

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 29: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Determine the Risks FDA levels of concern

Minor – not expected to result in injury or death

Moderate – results in minor to moderate injury

Major – result in major injury or death

German TUV characterization

(S) Severity of the risk

(E) Duration of the period of exposure

(G) Prevention of the danger

(W) Probability of occurrence

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 30: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Determine the Risks German TUV characterization

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 31: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Determine the Risks German TUV Example

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 32: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Define the Safety Measure Obviation – make the hazard physically impossible

Education – User training

Alarming – Announce the haard so action can be taken

Interlocks – removed via secondary device or logic to interceded

Internal Checking – the system detects and handles the malfunction prior to an incident

Safety Equipment – goggles, gloves, etc

Restriction of access – access to potential hazards is restricted to trained personnel

Labeling – High Voltage, do not touch

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 33: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Create Safe Requirements Consider the requirements from a safety perspective

Specify the negations

The system shall not move hardware before user input

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 34: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Create Safe Designs Work from safe requirements

Adopt a safe architecture

Revisit, revise the hazard analysis during development

Select measures that provide appropriate levels of detection and correction

Ensure independent channels lack common-mode failures

Adopt consistent strategies for handling faults

Include POST and periodic run-time tests

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 35: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Implementing Safety Language Choice

Strong compile-time checking

Strong run-time checking

Support for encapsulation and abstration (but not “just because”)

Exception handling

“Safe” language constructs

Void*?

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 36: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Assure the Safety Process Continuously track against hazard analysis

Utilize peer reviews to assure quality

Verify design adherence

Verify coding standards

Identify how each hazard is handled

(c) Copyright 2012 Dr. Phillip A. LaPlante

Page 37: Real-Time & Embedded Systems - RITswen-563/slides/C18_-_SafetyCritical.pdf · Robotics Stray EM ... Internal Checking – the system detects and handles the malfunction prior to an

Test, test, test Black box testing

White box testing

Monkey testing

Fault seeding

Load testing

Simulations

System testing

Unit testing

(c) Copyright 2012 Dr. Phillip A. LaPlante