May 2002 11
Science is supposed to learnmore from its failures thanits successes. Schools ac-tively teach trial and error:If at first you dont succeed,
try, try again. On the other hand, weretold to look before we leap but that hewho hesitates is lostso maybe wereall doomed no matter what.
This notion that trying and failing isnormal, and natural, can lead youastray in your professional life, how-ever. In practice, its a lot harder to geta conference paper accepted that is ofthe form, We wanted to help solveimportant problem X, so we tried Y,but instead of the hoped-for result Z,we got yucky result Q. And when Isay a lot harder, I mean theres noway. This is true in science as well asengineering, and its responsible for theoverall impression you get when read-ing conference proceedings: Everythingother people try works perfectly.Youre the only one who goes downblind alleys and suffers repeated fail-ures.
Theres a good reason why confer-ences and journals are biased this way.Sometimes theres a very fine linebetween good ideas that didnt workout and just plain stupid ideas. Andsince there are a thousand wrong waysto do something for every way thatworks, a predisposition for papers withpositive results is a quick, generallyeffective filter that saves conferencecommittees a lot of time. (They needto save that time so they can use it later,debating how to break it to a fellowconference committee member thatthey want to reject his paper. As Dave
Barry would say, Im not making thisup.) Of course, prospective authorsknow this, and they would be in dan-ger of biasing their results toward themost positive light if it werent for theirdedication to the scientific ideal. (Areyou buying any of that? Me neither.)
Scientists get to run their experi-ments multiple times, and they cangradually reduce or eliminate sourcesof systematic errors that might other-wise bias their results. Engineers tendto get only one chance to do some-thing: one bridge to build, one CPU todesign, one software development pro-ject to crank out late, slow, and.Whoops. I got carried away there.
ANTICIPATIONBecause engineers generally cant test
their creations to the point of satura-tion, they must make do with a lot ofsubstitutions: anticipation of all possi-ble failure modes; a comprehensive set
of requirements; dedicated validationand verification teams; designing witha built-in safety margin; formal verifi-cation where possible; and testing, test-ing, testing. If you didnt test it, itdoesnt work.
In some cases, computers havebecome fast enough to permit testingevery combination of bit patterns. Ifyoure designing a single-precisionfloating-point multiplier, you know theunit is correct because you tested everypossible way it could be used. Now allyou have to do is worry about resets,protocol errors into and out of theunit, electrical issues, noise, thermals,crosstalk, and why your boss didntsmile at you when she said good morn-ing yesterday.
Testing to saturationBut many, perhaps most, things you
design cant be tested to saturation.Double-precision floating-point num-bers have so many bit patterns that itshard to see when, if ever, such unitscould be tested to saturation. Two tothe 64th power is an extremely largenumber, and there are usually twooperands involved. And thats for afunctional unit with a fairly regulardesign and state sequence. Saturationtesting of anything beyond trivial soft-ware is often even more out of thequestion.
So it behooves us to try to anticipatehow our designs will be used, certainlyunder nominal conditions, but alsounder non-nominal conditions, whichusually place the system under higherstress. Programmers have a range oftechniques at their disposal, such asdefensive programming (checkinginput values before blindly usingthem), testing assertions (automaticallychecking for conditions the program-mer thought were impossible), andalways checking function return val-ues. Engineers design in fail-safe or fail-soft mechanisms because they knowthat the unexpected happens all thetime. If youve ever driven a car thathad the check-engine-light illuminated,you were in one of those modes.
If You Didnt Test It, It Doesnt WorkBob Colwell
Most things you design cant be tested to saturation.
A t R a n d o m
Real-world usesOne of the more difficult tasks in
engineering is trying to imagine all ofthe conditions under which your cre-ation will be used in the real worldor, if youre an aerospace engineer, inthe real universe. There are user inter-faces to consider: Automobile manu-facturers long ago resigned themselvesto designing to the lowest commondenominator of the human populationin terms of intelligence applied to theoperation of a motor vehicle. But idiot-proofing is difficult precisely becauseidiots are so clever.
There are also legal concernsjuriesthat award damages to people whothink driving with a hot cup of coffeebetween their legs is a reasonableproposition may well fail to grasp theengineering tradeoffs inherent in anydesign. How do we protect idiots fromthemselves?
DESIGNING WITH MARGINAs a conscientious designer, you try
to anticipate all the ways your designmay be used in the future. Since you dooutstanding work, you expect that youor someone else will reuse your codeor your silicon in the future. So you tryto guess what that future designer willwant and strive to provide it.
Youve also been around this par-ticular block a few times before, soyou build in some flexibility becauseyou know that project managementwill change its mind about certainmajor project goals as you go along.And since debugging was such a bearlast time, you allow for lots of debughooks and performance counters. Ifyou keep traveling down this road,youre going to make yourself crazy.Yes, more than you are now. You needto compromise.
The art of compromiseEngineering is the art of compro-
mise: now versus later, concise versusgeneralized, schedule versus thor-oughness. And all around you, theother designers are falling behind andasking for helpyour help.
This is one area of engineering inwhich you cant simply overwhelm theproblem with force. You want an ele-gant, intelligent, balanced creation thatmeets all of the specs and also the intentof the product, stated or otherwisenot a sandbagged, bloated, turkey of adesign. Yes, it will take compromises,but the magic is in getting those com-promises right. Engineers know elegantdesigns when they see them. Aspire toproduce them.
Tragedy of the commonsAlways keep in mind the tragedy of
the commons. The idea is that thereare shared resources, initially allocatedin the hope of having enough for all,but overall management of thoseresources is essentially a distributedlocal function.
If youre a silicon chip designer, yourjob is to implement your piece of thedesign in the area you were allocated,and to do so within the power, sched-ule, and timing constraints. You knowthat those constraints are somewhatfungible: If you had more scheduletime, you could compact your real-estate footprint. With more thermalpower headroom, you could meet yourclocking constraint.
The tragedy of the commons occurswhen, faced with the same problemsyou face, your fellow designers decideto borrow more of the chip area thanthey were allocated. It appears to eachof them that theyre using only a tinyfraction of the unallocated chip space,while substantially boosting the diearea available to their particular func-tion. And so they are. But if everyonedoes that, the chips overall die sizewill exceed the target, with no margin
left to do anything about it. Thinklocally and globally.
VALIDATION AND VERIFICATIONDid I mention this: If you didnt test
it, it doesnt work. In DragonFly:NASA and the Crisis Aboard MIR(HarperCollins, New York, 1998),Bryan Burrough gives a rivetingaccount of several near catastropheson the Russian space station.
In one instance, an oxygen genera-tor catches fire, and a blowtorch-likeflame begins to bloom. Jerry Linenger,the sole American astronaut, grabs anoxygen mask, but it doesnt work. Hegrabs another one that does work. Thestation commander leads Linenger tothe station module where the fire extin-guishers are kept. Linenger grabs onebut is startled to find it is secured tothe wall. Both men pull at it, but thewall wins. They try another extin-guisher, with the same outcome.
Later analysis revealed that thetransportation straps for the fire extin-guishers were still installedin theintervening 19 months of service, noone had thought to wield the wrenchrequired to remove them. In perfecthindsight, this also suggests that what-ever fire drills had been performed inthose months werent realistic enoughor the astronauts would have foundthe problem earlier.
An important distinctionNASA draws a useful distinction
between validation and verification.Verification is the act of testing a designagainst its specifications. Validationtests the system against its operationalgoals.
An example of why the distinctionis important comes from Endeavoursmaiden flight, when it tried to ren-dezvous with the Intelsat satellite. Asit turned out, the state-space part ofthe shuttles programming had beendone with double-precision floating-point arithmetic, and the boundarychecking part was done with single-precision arithmetic. The shuttlesattempted rendezvous didnt converge
The tragedy of the commons occurs when
designers decide to borrow more of the chip area than they