CSCI 5230: Project Management
Software Reuse Disasters: Therac-25 and Ariane 5 Flight 501
David Sumpter
12/4/2001
12/04/01 2
Contents
Introduction 3
Therac-25 – Background 4
Therac-25 – Software Process 5
Therac-25 – Causes 6
Flight 501 – Sequence of Events 7
Ariane 5 Software Process 8
Flight 501 – Cause of Failure 9
Conclusion 10
References 11
12/04/01 3
Introduction
Two famous software engineering disasters Therac-25
Medical accelerator to treat tumors 6 known accidents resulting in death or serious injury
June 1985 – January 1987 Software was adapted from earlier models
Ariane 5 Flight 501 Maiden flight of Ariane 5 launch vehicle
Larger, more powerful successor to the Ariane 4 Exploded approximately 40 seconds after launch
June 1996 Loss traced to software carried over virtually unchanged
from Ariane 4
12/04/01 4
Therac-25 – Background
25 MeV medical accelerator Designed to destroy tumors Dual Mode
Electron beam or X-rays Successor to Therac-6, Therac-20
Therac-6, Therac-20 Computer control added to earlier machines Still capable of stand-alone (no computer) operation
All standard hardware safety mechanisms
Therac-25 more dependent on software Lacked many hardware safety mechanisms of earlier
accelerators
Therac-25 software “evolved from” Therac-6 code PDP-11 assembly, no standard OS Also contained Therac-20 code
12/04/01 5
Therac-25 – Software Process
Little, if any, process
Single programmer
Minimal unit and software testing Emphasis on integrated system testing
1983 safety analysis, in effect, assumed that software had no errors! “Programming errors have been reduced by extensive
testing ... Any residual software errors are not included in the analysis.”
“Computer execution errors are caused by faulty hardware components and by ‘soft’ (random) errors induced by alpha particles and electromagnetic noise.”
12/04/01 6
Therac-25 – Causes
For three of the six known incidents, cause is unknown “there is no way to determine what particular design errors
were related to the... accidents. Given the unsafe programming practices in the code, it is possible that unknown race conditions or errors could have been responsible”
For two fatal accidents in Tyler, Texas Race condition led to inconsistent machine settings, leading to
massive radiation overdoses Same bug found in Therac-20
Hardware safeguards prevented it from causing injuries, from even being discovered until after the Tyler accidents
Fatal accident at Yakima, Washington Overflow of 1-byte variable which led, under rare conditions,
to improper machine settings, leading to massive radiation overdose.
12/04/01 7
Flight 501 – Sequence of Events
Near-simultaneous failure of the primary and back-up Inertial Reference Systems (SRIs) 36 seconds after main engine ignition
Nozzles of the two solid boosters and main engine swivel to extreme positions Nozzles direct rocket thrust, steer launcher Caused launcher to veer abruptly
Links between the solid boosters and the core stage rupture triggered self-destruct
12/04/01 8
Ariane 5 Software ProcessStringent processes in place, but…“the culture within the Ariane program…” only addressed “random hardware failures… which can quite rationally be handled by a backup system”“the view had been taken that software should be considered correct until it is shown to be at fault”! (emphasis added)SRIs were not included, but simulated by special software, in integrated tests Technically difficult and expensive SRIs considered “fully qualified at equipment level”
“The design of the Ariane 5 SRI is practically the same as that of an SRI which is presently used on Ariane 4, particularly as regards the software”
12/04/01 9
Flight 501 – Cause of Failure
Software exception “during… data conversion from 64-bit floating point to 16-bit signed integer” Occurred in SRI software Overflow caused by unexpectedly high value for Horizontal
Bias (BH) variable BH related to horizontal velocity Not protected to save computer processing power
Analysis had determined that overflow could not occur Reasoning not documented in code Ariane 5 has higher horizontal velocity, early in trajectory,
than Ariane 4! That part of software where error occurred was not needed
after launch Requirement to continue operating after launch traces to earlier
versions of Arian Enabled prompt re-start of count-down in event of a hold Did not apply to Ariane 5, but maintained for commonality
12/04/01 10
Conclusion
In both cases, software was carried over from earlier projects where it had seemingly worked well Therac-25
Software defects in earlier machines were hidden by hardware safeguards
No real software development process Apparently no serious evaluation of risks involved in
using software in lieu of hardware safeguards
Ariane 5 Known “defect” was non-issue on Ariane 4 Established software development process in place Issues were considered, but key factor was missed
12/04/01 11
Conclusion, cont.
Misunderstanding of software? Both were primarily hardware projects
Reuse of existing software in the development of new hardware
Not only underestimated complexity of software, but failed to recognize that it was even an issue
Both projects made the absolutely astounding assumption that the software didn’t have errors!
Assumed “black box” that could be swapped in and out of different applications
No evidence that reuse was considered in design of software
12/04/01 12
References
Inquiry Board. Ariane 5 Flight 501 Failure. Inquiry Board report (July 1996). Available online at
http://www.mssl.ucl.ac.uk/www_plasma/missions/cluster/ariane5rep.html
Leveson, N., Turner, C.S. An Investigation of the Therac-25 Accidents. IEEE Computer, vol. 26, no. 7 (July 1993), 18-41. Available online at
http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html
12/04/01 13
Thank YouThank You