Upload
tranthuan
View
230
Download
0
Embed Size (px)
Citation preview
Real-Time & Embedded Systems
Agenda
Safety Critical Systems
Project 6 continued
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Critical Systems “Safe enough” looks different at 35,000 feet.
– Bruce Powell Douglass
“The Air Force has a perfect operating record … everything we put in the air has come back down.”
- Unknown
(c) Copyright 2012 Dr. Phillip A. LaPlante
Ubiquity of Control Systems Electro-mechanical devices are migrating to software-
driven systems
Automobiles
Planes
Home Appliances
Medical Equipment
Nuclear Power Plants
(c) Copyright 2012 Dr. Phillip A. LaPlante
Software Failures Therac-25
Radiation therapy device
Software-driven
Bugs allowed massive radiation overdoses
Killed 3 people, contributed to the death of a fourth
(c) Copyright 2012 Dr. Phillip A. LaPlante
Software Failures Patriot Missiles
Clock drift reduced their effectiveness from 95% to 13%
Allowed a SCUD missile through defense perimeter
Killed 29, injured 97
Aegis tracking system
Failure contributed to shooting down an Iranian Airline flight
290 lives lost
(c) Copyright 2012 Dr. Phillip A. LaPlante
Software Failures 8080-based factory control software
Mistakenly stacked large boulders 80 feet high
Crushed cars and damaged a building
Robotics
Stray EM interference blamed for 19 deaths
Cardiac pacemakers
Low-energy radiation reprogrammed
Caused several deaths
(c) Copyright 2012 Dr. Phillip A. LaPlante
Software Failures Medical Database Software
Incorrectly informed woman she had incurable syphilis and had passed it on to her children
She strangled one, attempted to kill another and herself
Sunlight Filtering Software
Failed to remove false missile detections based on sunlight reflecting off clouds
A Soviet Commander averted nuclear war based on a “… funny feeling in my gut.”
(c) Copyright 2012 Dr. Phillip A. LaPlante
Terms Reliability – the measure of up-time, or availability of a
system The probability that a task will complete before the system
fails
Measured in Mean Time Between Failures (MTBF)
Security – permitting access to only authorized and authenticated persons of systems
Safety – does not incur too much risk to person or property
Risk – the chance that something bad will happen
Common-mode failure – a single failure results in the failure of multiple control paths
(c) Copyright 2012 Dr. Phillip A. LaPlante
Fundamental Hazards Release of energy
Release of toxins
Interference of life-support functions
Supplying misleading information to safety personnel or control systems
Failure to alarm when hazardous conditions arise
Failure to limit or act when unwanted events occur, inputs are flawed or outputs are outside correct levels
(c) Copyright 2012 Dr. Phillip A. LaPlante
System Issues Safety is a system issue
Multiple solutions may address a concern
Interlocks
Redundant hardware
Redundant software
The interaction of the components determines the safety of the system
(c) Copyright 2012 Dr. Phillip A. LaPlante
Software Failures Software does not fail
Failures represent a change in the capability of the system
Broken switch
Failed component
Bad sensor
If software does something wrong, it does it every time!
Software may respond poorly to failures
(c) Copyright 2012 Dr. Phillip A. LaPlante
Single-point Failures A device is considered safe if a single failure in the
system does not result in an unsafe condition
Single-point assessments tree:
(c) Copyright 2012 Dr. Phillip A. LaPlante
Fail-Safe State A condition a safety-critical system must attain with
an unrecoverable fault.
Emergency Stop
Partial Shutdown
Hold
Manual Control
Restart
Driven by the problem domain needs
(c) Copyright 2012 Dr. Phillip A. LaPlante
Fail-Safe states An airliner jet engine fails?
Unmanned space vehicle launch?
Attended medical devices?
Hazardous area robotics?
Unmanned aircraft control failure?
Cruise ship rudder failure?
(c) Copyright 2012 Dr. Phillip A. LaPlante
Achieving Safety Separation of safety channels from non-safety
channels
Firewall pattern
Any component failure in the channel fails the entire channel
Isolation of safety systems from non-safety systems is common and justifiable
Redundancy
Small or large scale
Homogenous or diverse
(c) Copyright 2012 Dr. Phillip A. LaPlante
Achieving Safety Homogenous
Channels are replicated verbatim
Detects only faults, not errors
Inexpensive
Diverse
A different channel is implemented
Detects faults and errors
More expensive
(c) Copyright 2012 Dr. Phillip A. LaPlante
Achieving Safety Diverse redundancy is stronger
Protects against systemic faults / errors
Data corruption detection
Parity bit
Hamming codes (parity bits)
Checksums
CRCs
Redundant storage
(c) Copyright 2012 Dr. Phillip A. LaPlante
Achieving Safety Reasonableness checks
A second algorithm validating the results of the first
Usually much simpler
Feedback error detection Identify potential fault conditions
May cause a fail-safe transition
Feedback error correction Identify and correct potential fault conditions
Attempts to keep the system operating, and may reduce capability
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architectures Single-Channel Protected Design
A single flow of control
A break in the channel induces a failure
Safeguards are added to ensure correct fail-safe behavior
A single point of failure
Multi-channel Voting Pattern An odd number of redundant channels
Each channel “votes” on the task
Majority rules
Homogenous or diverse
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architectures Homogenous Redundancy Pattern
Identical channels run in parallel
If an odd number of channels:
Majority channels detect and correct minority channels
Must be fully redundant
Inexpensive to implement
Detects only faults, not errors
May be expensive due to redundant hardware
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architectures Diverse Redundancy Pattern
Redundant, but uniquely implemented channels
Different but equal
Lightweight redundancy
Separation of monitoring and actuation
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architectures Watchdog Pattern
A secondary process monitors the primary process
Primary process periodically “feeds” the secondary process
Secondary process can alarm or restart should the primary process fail
May include a periodic test suite
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architectures Safety Executive Pattern
A centralized coordinator for monitoring safety
A really smart watchdog
Watchdog timeouts
Software error assertions
Continuous or periodic built-in tests
Faults indentified by monitors
(c) Copyright 2012 Dr. Phillip A. LaPlante
Safety Architecture Monitor-actuator pattern
Separation of algorithms
Actuation performs the actions
Monitoring tracks the actions
Additional cost and complexity
(c) Copyright 2012 Dr. Phillip A. LaPlante
Eight Steps to Safety Identify the hazards
Determine the risks
Dfine the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test (Peer Reviews!)
(c) Copyright 2012 Dr. Phillip A. LaPlante
Identify the Hazards Identify the hazard
Determine the level of risk
Determine the tolerance time
Determine the source of the hazrd: The fault leading to the hazard
The likelihood of the fault
The fault detection time
The means by which the hazard is handled: The means
The fault reaction (exposure time)
(c) Copyright 2012 Dr. Phillip A. LaPlante
Identify the Hazards Patient Ventilator Example:
(c) Copyright 2012 Dr. Phillip A. LaPlante
Fault Analysis Fault-tree analysis (FTA)
Identify the hazards
Work backward from the hazard to identify the causal conditions
Diagram with a boolean flow chart
UML Activity diagram
Failure mode effect analysis (FMEA)
Identify potential faults
Work forward to the consequences
(c) Copyright 2012 Dr. Phillip A. LaPlante
Determine the Risks FDA levels of concern
Minor – not expected to result in injury or death
Moderate – results in minor to moderate injury
Major – result in major injury or death
German TUV characterization
(S) Severity of the risk
(E) Duration of the period of exposure
(G) Prevention of the danger
(W) Probability of occurrence
(c) Copyright 2012 Dr. Phillip A. LaPlante
Determine the Risks German TUV characterization
(c) Copyright 2012 Dr. Phillip A. LaPlante
Determine the Risks German TUV Example
(c) Copyright 2012 Dr. Phillip A. LaPlante
Define the Safety Measure Obviation – make the hazard physically impossible
Education – User training
Alarming – Announce the haard so action can be taken
Interlocks – removed via secondary device or logic to interceded
Internal Checking – the system detects and handles the malfunction prior to an incident
Safety Equipment – goggles, gloves, etc
Restriction of access – access to potential hazards is restricted to trained personnel
Labeling – High Voltage, do not touch
(c) Copyright 2012 Dr. Phillip A. LaPlante
Create Safe Requirements Consider the requirements from a safety perspective
Specify the negations
The system shall not move hardware before user input
(c) Copyright 2012 Dr. Phillip A. LaPlante
Create Safe Designs Work from safe requirements
Adopt a safe architecture
Revisit, revise the hazard analysis during development
Select measures that provide appropriate levels of detection and correction
Ensure independent channels lack common-mode failures
Adopt consistent strategies for handling faults
Include POST and periodic run-time tests
(c) Copyright 2012 Dr. Phillip A. LaPlante
Implementing Safety Language Choice
Strong compile-time checking
Strong run-time checking
Support for encapsulation and abstration (but not “just because”)
Exception handling
“Safe” language constructs
Void*?
(c) Copyright 2012 Dr. Phillip A. LaPlante
Assure the Safety Process Continuously track against hazard analysis
Utilize peer reviews to assure quality
Verify design adherence
Verify coding standards
Identify how each hazard is handled
(c) Copyright 2012 Dr. Phillip A. LaPlante
Test, test, test Black box testing
White box testing
Monkey testing
Fault seeding
Load testing
Simulations
System testing
Unit testing
(c) Copyright 2012 Dr. Phillip A. LaPlante