53
Reliability and Safety Week 7 What can go wrong?

Reliability and Safety Week 7 What can go wrong?

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Reliability and Safety Week 7 What can go wrong?

Reliability and Safety

Week 7

What can go wrong?

Page 2: Reliability and Safety Week 7 What can go wrong?

Issues:Issues:

Hardware Errors Software Errors

Fault vs Error

Page 3: Reliability and Safety Week 7 What can go wrong?

Computer failure causes:Computer failure causes:

Faulty design Sloppy implementation Careless or insufficiently trained

users Poor user interfaces Hardware/Software malfunctions Specification errors Scope/Application inconsistency

Page 4: Reliability and Safety Week 7 What can go wrong?

Computer users perspectiveComputer users perspective

Should understand limitations of the computers

Need for proper training Need for responsible use Difference between good

products and bad ones

Page 5: Reliability and Safety Week 7 What can go wrong?

Computer Professional PerspectiveComputer Professional Perspective

Study computer failures Study computer ethics

Page 6: Reliability and Safety Week 7 What can go wrong?

Educated Member of Society PerspectiveEducated Member of Society Perspective

Help us evaluate the reliability and safety of various computer applications

Help evaluate computer technology

Page 7: Reliability and Safety Week 7 What can go wrong?

Three Categories of FailuresThree Categories of Failures

Problems for individuals System failures that affect

large numbers of people or cost large amounts of money

Problems in safety-critical applications

Page 8: Reliability and Safety Week 7 What can go wrong?

Problems for IndividualsProblems for Individuals

Billing Errors design and/or implementation

of programs Not enough care - input error Not enough testing -

reasonable range Not enough training

Page 9: Reliability and Safety Week 7 What can go wrong?

Database Accuracy ProblemsDatabase Accuracy Problems

Info in database is not accurate

Automatic entering of info - mistakes can be overlooked

Copies of incorrect info can be in other systems

Not knowledgeable enough about the system

Page 10: Reliability and Safety Week 7 What can go wrong?

Causes Causes

Large population Most of our financial

interactions are with strangers Automated processing without

human common sense Overconfidence in accuracy of

data Lack of accountability

Page 11: Reliability and Safety Week 7 What can go wrong?

Consumer Hardware and SoftwareConsumer Hardware and Software

Usually have more serious errors in their first releases

Regularly sold with known bugs Hardware also has flaws tradeoff between cost, debugging,

and marketing Dishonesty, denials of problems,

lack of adequate response to complaints

Page 12: Reliability and Safety Week 7 What can go wrong?

System FailuresSystem Failures

Lots of $$$$ Complete shutdown of basic

services Areas:

communications Business and financial

systems Military

Page 13: Reliability and Safety Week 7 What can go wrong?

WHY?WHY?

Not enough testing Technical difficulties Poor management

decisions Dishonesty in promoting

the system and responding to problems

Page 14: Reliability and Safety Week 7 What can go wrong?

CommunicationsCommunications

Phone Service How Bad?

pagers phone calls 911 Communications for airports cellular phones

Page 15: Reliability and Safety Week 7 What can go wrong?

Business and financial systemsBusiness and financial systems

Stock exchange ATM Contest by Pepsi

too many winning tickets issued

Page 16: Reliability and Safety Week 7 What can go wrong?

Destroying BusinessDestroying Business

Loss of sales incorrect info affects

business dissatisfied customers incorrect prices loss of data

Page 17: Reliability and Safety Week 7 What can go wrong?

MilitaryMilitary

Data management Weapons system design Battle simulation Battle management

command/control communications intelligence

Nuclear war

Page 18: Reliability and Safety Week 7 What can go wrong?

Why?Why?

Not enough testing technical difficulties poor management decisions dishonesty in promoting the

system and responding to problems

Results in delays and abandonment of projects

Page 19: Reliability and Safety Week 7 What can go wrong?

The Denver Airport baggage systemThe Denver Airport baggage system

Outbound luggage checked at ticket counters or curbside to be delivered to anywhere in

<10 minutes via automated system of cars on

tracks connecting flights or terminals

Laser scanners tracks - 4000 cars

Page 20: Reliability and Safety Week 7 What can go wrong?

Problems EncounteredProblems Encountered

Cars crash into each other at intersections

Luggage misrouted, dumped or flung

Needed cars were idle or put to rest

Page 21: Reliability and Safety Week 7 What can go wrong?

Specific problemsSpecific problems

Real world problems scanners got dirty knocked out of

alignment Software error

rerouting of cars to waiting area - idle

Page 22: Reliability and Safety Week 7 What can go wrong?

CausesCauses

Time allows for development and testing was insufficient

Significant changes in specifications were made after project began

Not enough debug time Poor management Unrealistic plan

Page 23: Reliability and Safety Week 7 What can go wrong?

Safety Critical ApplicationsSafety Critical Applications Use of computers is increasing rapidly in

these areas Use of computers in these areas can

save $ Areas

Military Medical Applications

Power plants Aircraft Trains

Page 24: Reliability and Safety Week 7 What can go wrong?

Aircraft - Fly by WireAircraft - Fly by Wire

Pilots do not directly control plane Actions are input to computers

that control the aircraft systems Pilot interaction is critical Need for easy way to override

computers Easy transfer between automatic

and manual control

Page 25: Reliability and Safety Week 7 What can go wrong?

Air Traffic ControlAir Traffic Control

Long delays Increased risk of collisions Old machines - computer

systems Political - government

spends $ elsewhere

Page 26: Reliability and Safety Week 7 What can go wrong?

Case Study - Therac-25Case Study - Therac-25

Software controlled radiation therapy machine used to treat people with cancer

Problems: Massive overdoses administered Repeated overdoses due to faulty

display Death

Operated in dual machine mode - electron beam or x-ray photon beam

Page 27: Reliability and Safety Week 7 What can go wrong?

Why?Why?

Lapses in good safety design Insufficient testing Bugs in software that

controlled machines Inadequate system of

reporting and investigating accidents and deaths

Page 28: Reliability and Safety Week 7 What can go wrong?

Specific problemsSpecific problems

Some hardware safety features were eliminated in newer models

Software used was assumed correct form older systems

Malfunctioned frequently Weakness in design of operator

interface inadequate explanation of error

messages if any

Page 29: Reliability and Safety Week 7 What can go wrong?

Specific problems continuedSpecific problems continued

Machine allowed one-key intervention versus automatic shutdown

Inadequate documentation Poor test plan

Page 30: Reliability and Safety Week 7 What can go wrong?

Software Errors - bugsSoftware Errors - bugs

Fatal error was a simple fix Fixes are complex, expensive, and

prevents use of machine while fixing Bugs

can be intermittent and hard to detect

importance of self checking importance of using good

programming techniques

Page 31: Reliability and Safety Week 7 What can go wrong?

OverconfidenceOverconfidence

Leaving out changes that are necessary

Ignoring error messages Not using backup devices

(video or audio)

Page 32: Reliability and Safety Week 7 What can go wrong?

Conclusion and PerspectiveConclusion and Perspective Irresponsibility leads to criminal

charges Responsibility leads to merit awards Importance of good software

development Consequences of carelessness, cutting

corners, unprofessional work, or attempts to avoid responsibility

Lack of appreciation for risks Poor training

Page 33: Reliability and Safety Week 7 What can go wrong?

Ways to prevent problemsWays to prevent problems

Good computer systems Good training Accountability Individual responsibility Management responsibility IE IEEE Code of Ethics

Page 34: Reliability and Safety Week 7 What can go wrong?

Increasing Reliability and SafetyIncreasing Reliability and Safety

What goes wrong? Many lines of code and

many programmers Problems are

managerial, technical, social, legal, ethical

Page 35: Reliability and Safety Week 7 What can go wrong?

OverconfidenceOverconfidence

Unappreciative of risks Ignore warnings Don’t consult manuals

Page 36: Reliability and Safety Week 7 What can go wrong?

Professional TechniquesProfessional Techniques

Use good software engineering techniques at all stages of development: Requirements Specs design implementation documentation testing

Page 37: Reliability and Safety Week 7 What can go wrong?

Professional TechniquesProfessional Techniques

Study the techniques and tools available

Knowing or learning enough about the application field and the software or systems being used

Page 38: Reliability and Safety Week 7 What can go wrong?

Why Study Failures?Why Study Failures?

Provides technical lessons Leads to improved

hardware and software products

Provide ethical data Lead to improved ethical

codes/laws

Page 39: Reliability and Safety Week 7 What can go wrong?

Lessons LearnedLessons Learned

Accidents are not the result of unknown scientific principles but rather a failure to apply well-known engineering practices

Accidents will not be prevented by technological fixes alone, requires control of all aspects of the development and operation of the system

Page 40: Reliability and Safety Week 7 What can go wrong?

Lessons LearnedLessons Learned

Software developers need to recognize the limitations of software, and use hardware safety mechanisms

Page 41: Reliability and Safety Week 7 What can go wrong?

Redundancy and Self-checkingRedundancy and Self-checking

Redundancy - judging - expensive Complex systems collect

information to diagnose and correct errors

Audit trails are vital Detail records help protect against

theft and help trace and correct errors

Page 42: Reliability and Safety Week 7 What can go wrong?

Redundancy and Self-checkingRedundancy and Self-checking Designed to constantly monitor itself

and correct problems automatically Half of the computing power is devoted

to checking The rest for errors

closes off part of teh system reroutes corrects problems and reroutes again

Page 43: Reliability and Safety Week 7 What can go wrong?

TESTINGTESTING

CRITICAL! Principles and techniques

exist can use another company

to perform Independent verification and validation

Page 44: Reliability and Safety Week 7 What can go wrong?

Dangerous TendenciesDangerous Tendencies

Operators bypass check mechanisms through

familiarity Technicians

Blame random mechanical or signal glitches rather than software

Corporate Managers Initially deny and ignore - then cover

up Finally - deal with expensive fixes

Page 45: Reliability and Safety Week 7 What can go wrong?

Overall Lessons LearnedOverall Lessons Learned

Should not declare problem understood with first hypothesis

Should not expect management to follow through on field reports

Overconfidence in software leads to economical marginal designs

Page 46: Reliability and Safety Week 7 What can go wrong?

Overall Lessons LearnedOverall Lessons Learned

Enforcement of software engineering practices is often abysmal

Basing risk assessments on individual subsystems often leads to unrealistic optimism

Page 47: Reliability and Safety Week 7 What can go wrong?

Lessons for systems engineeringLessons for systems engineering

Hardware backups valuable Software must not be

presumed innocent Software errors related can be

indistinguishable Audit trails are critical Risk estimates are subjective User feedback is valuable

Page 48: Reliability and Safety Week 7 What can go wrong?

Lessons for software engineeringLessons for software engineering

Documentation should be on-going Designs should be kept simple Testing should be built into

software Software must be tested out of

system and in system Reuse of software should be tested

like new software

Page 49: Reliability and Safety Week 7 What can go wrong?

Lessons for oversightLessons for oversight

Users are more likely to make initial observations than monitoring officials

Users need reliable information in order to be maximally valuable

Page 50: Reliability and Safety Week 7 What can go wrong?

Laws and RegulationsLaws and Regulations

Criminal and Civil penalties Suits against company that

designs or sells the system Criminal charges when fraud

or criminal negligence occurs Need contracts Need well designed laws and

standards

Page 51: Reliability and Safety Week 7 What can go wrong?

RegulationRegulation

Requirement for approval by a government agency before a new product can be sold including specific testing requirements

The profit motive cause skimping on safety

Better to abandon in some cases Inadequate abilities to judge by customer Hard to sue large companies

Page 52: Reliability and Safety Week 7 What can go wrong?

RegulationRegulation

Expensive and time-consuming

Newer procedures may not be enforced

Lots of paperwork

Page 53: Reliability and Safety Week 7 What can go wrong?

Professional licensingProfessional licensing

Licensing of software development professionals to protect against poor quality and unethical behavior Specific training Passing competency exam Ethical requirements Continuing education