May 2002 11
Science is supposed to learnmore from its failures thanits successes. Schools ac-tively teach trial and error:If at first you dont succeed,
try, try again. On the other hand, weretold to look before we leap but that hewho hesitates is lostso maybe wereall doomed no matter what.
This notion that trying and failing isnormal, and natural, can lead youastray in your professional life, how-ever. In practice, its a lot harder to geta conference paper accepted that is ofthe form, We wanted to help solveimportant problem X, so we tried Y,but instead of the hoped-for result Z,we got yucky result Q. And when Isay a lot harder, I mean theres noway. This is true in science as well asengineering, and its responsible for theoverall impression you get when read-ing conference proceedings: Everythingother people try works perfectly.Youre the only one who goes downblind alleys and suffers repeated fail-ures.
Theres a good reason why confer-ences and journals are biased this way.Sometimes theres a very fine linebetween good ideas that didnt workout and just plain stupid ideas. Andsince there are a thousand wrong waysto do something for every way thatworks, a predisposition for papers withpositive results is a quick, generallyeffective filter that saves conferencecommittees a lot of time. (They needto save that time so they can use it later,debating how to break it to a fellowconference committee member thatthey want to reject his paper. As Dave
Barry would say, Im not making thisup.) Of course, prospective authorsknow this, and they would be in dan-ger of biasing their results toward themost positive light if it werent for theirdedication to the scientific ideal. (Areyou buying any of that? Me neither.)
Scientists get to run their experi-ments multiple times, and they cangradually reduce or eliminate sourcesof systematic errors that might other-wise bias their results. Engineers tendto get only one chance to do some-thing: one bridge to build, one CPU todesign, one software development pro-ject to crank out late, slow, and.Whoops. I got carried away there.
ANTICIPATIONBecause engineers generally cant test
their creations to the point of satura-tion, they must make do with a lot ofsubstitutions: anticipation of all possi-ble failure modes; a comprehensive set
of requirements; dedicated validationand verification teams; designing witha built-in safety margin; formal verifi-cation where possible; and testing, test-ing, testing. If you didnt test it, itdoesnt work.
In some cases, computers havebecome fast enough to permit testingevery combination of bit patterns. Ifyoure designing a single-precisionfloating-point multiplier, you know theunit is correct because you tested everypossible way it could be used. Now allyou have to do is worry about resets,protocol errors into and out of theunit, electrical issues, noise, thermals,crosstalk, and why your boss didntsmile at you when she said good morn-ing yesterday.
Testing to saturationBut many, perhaps most, things you
design cant be tested to saturation.Double-precision floating-point num-bers have so many bit patterns that itshard to see when, if ever, such unitscould be tested to saturation. Two tothe 64th power is an extremely largenumber, and there are usually twooperands involved. And thats for afunctional unit with a fairly regulardesign and state sequence. Saturationtesting of anything beyond trivial soft-ware is often even more out of thequestion.
So it behooves us to try to anticipatehow our designs will be used, certainlyunder nominal conditions, but alsounder non-nominal conditions, whichusually place the system under higherstress. Programmers have a range oftechniques at their disposal, such asdefensive programming (checkinginput values before blindly usingthem), testing assertions (automaticallychecking for conditions the program-mer thought were impossible), andalways checking function return val-ues. Engineers design in fail-safe or fail-soft mechanisms because they knowthat the unexpected happens all thetime. If youve ever driven a car thathad the check-engine-light illuminated,you were in one of those modes.
If You Didnt Test It, It Doesnt WorkBob Colwell
Most things you design cant be tested to saturation.
A t R a n d o m
Real-world usesOne of the more difficult tasks in
engineering is trying to imagine all ofthe conditions under which your cre-ation will be used in the real worldor, if youre an aerospace engineer, inthe real universe. There are user inter-faces to consider: Automobile manu-facturers long ago resigned themselvesto designing to the lowest commondenominator of the human populationin terms of intelligence applied to theoperation of a motor vehicle. But idiot-proofing is difficult precisely becauseidiots are so clever.
There are also legal concernsjuriesthat award damages to people whothink driving with a hot cup of coffeebetween their legs is a reasonableproposition may well fail to grasp theengineering tradeoffs inherent in anydesign. How do we protect idiots fromthemselves?
DESIGNING WITH MARGINAs a conscientious designer, you try
to anticipate all the ways your designmay be used in the future. Since you dooutstanding work, you expect that youor someone else will reuse your codeor your silicon in the future. So you tryto guess what that future designer willwant and strive to provide it.
Youve also been around this par-ticular block a few times before, soyou build in some flexibility becauseyou know that project managementwill change its mind about certainmajor project goals as you go along.And since debugging was such a bearlast time, you allow for lots of debughooks and performance counters. Ifyou keep traveling down this road,youre going to make yourself crazy.Yes, more than you are now. You needto compromise.
The art of compromiseEngineering is the art of compro-
mise: now versus later, concise versusgeneralized, schedule versus thor-oughness. And all around you, theother designers are falling behind andasking for helpyour help.
This is one area of engineering inwhich you cant simply overwhelm theproblem with force. You want an ele-gant, intelligent, balanced creation thatmeets all of the specs and also the intentof the product, stated or otherwisenot a sandbagged, bloated, turkey of adesign. Yes, it will take compromises,but the magic is in getting those com-promises right. Engineers know elegantdesigns when they see them. Aspire toproduce them.
Tragedy of the commonsAlways keep in mind the tragedy of
the commons. The idea is that thereare shared resources, initially allocatedin the hope of having enough for all,but overall management of thoseresources is essentially a distributedlocal function.
If youre a silicon chip designer, yourjob is to implement your piece of thedesign in the area you were allocated,and to do so within the power, sched-ule, and timing constraints. You knowthat those constraints are somewhatfungible: If you had more scheduletime, you could compact your real-estate footprint. With more thermalpower headroom, you could meet yourclocking constraint.
The tragedy of the commons occurswhen, faced with the same problemsyou face, your fellow designers decideto borrow more of the chip area thanthey were allocated. It appears to eachof them that theyre using only a tinyfraction of the unallocated chip space,while substantially boosting the diearea available to their particular func-tion. And so they are. But if everyonedoes that, the chips overall die sizewill exceed the target, with no margin
left to do anything about it. Thinklocally and globally.
VALIDATION AND VERIFICATIONDid I mention this: If you didnt test
it, it doesnt work. In DragonFly:NASA and the Crisis Aboard MIR(HarperCollins, New York, 1998),Bryan Burrough gives a rivetingaccount of several near catastropheson the Russian space station.
In one instance, an oxygen genera-tor catches fire, and a blowtorch-likeflame begins to bloom. Jerry Linenger,the sole American astronaut, grabs anoxygen mask, but it doesnt work. Hegrabs another one that does work. Thestation commander leads Linenger tothe station module where the fire extin-guishers are kept. Linenger grabs onebut is startled to find it is secured tothe wall. Both men pull at it, but thewall wins. They try another extin-guisher, with the same outcome.
Later analysis revealed that thetransportation straps for the fire extin-guishers were still installedin theintervening 19 months of service, noone had thought to wield the wrenchrequired to remove them. In perfecthindsight, this also suggests that what-ever fire drills had been performed inthose months werent realistic enoughor the astronauts would have foundthe problem earlier.
An important distinctionNASA draws a useful distinction
between validation and verification.Verification is the act of testing a designagainst its specifications. Validationtests the system against its operationalgoals.
An example of why the distinctionis important comes from Endeavoursmaiden flight, when it tried to ren-dezvous with the Intelsat satellite. Asit turned out, the state-space part ofthe shuttles programming had beendone with double-precision floating-point arithmetic, and the boundarychecking part was done with single-precision arithmetic. The shuttlesattempted rendezvous didnt converge
The tragedy of the commons occurs when
designers decide to borrow more of the chip area than they
May 2002 13
levels of intelligent thought, and he usesnumerical sequences as the clearestanalogy-generators. You remembertheseWhat comes next in this se-quence: 5, 10, 15? Yep, 20 would bemy bet too.
So what comes next in the followingsequence: 0, 1, 2? It has to be 3, doesnt it? But wait, why would I giveyou a trivial sequence and claim itsgoing to stimulate your brain? Whatelse could it be, if not 3?
Even when I give you the answer,you probably still wont know wherethis is coming from. The answer is 0,1, 2, 720!. Thats 720 factorial: 0, 1, 2,720! = 0, 1, 2, 6!! = 0, 1!, 2!!, 3!!!.
The important thing about thisexample is not the actual mathinvolved per se. Its that when you firstsaw the sequence 0, 1, 2, your imme-diate instinct probably was to answer3 and move on. But some other partof your brain, your Validation Reflex,became instantly suspicious and said,Not so fast. Somethings not quiteright here.
Thats the voice you must learn tohear to do truly outstanding valida-tion. Thats the instinct that allows youto take a step back from the implicitassumptions that others around youare making so you can go in a differ-ent direction where you may see some-thing everyone else is missing.
Y eah, I knowI cheated a littlewith that math sequence. Itdoesnt seem quite fair to throwin factorials when the originalsequence didnt have them. You couldcomplain to Hofstadter. But youshould read his book first.
In the meantime, whats next in thissequence: If you didnt test it, ?
Bob Colwell was Intels chief IA32architect through the Pentium II, III,and 4 microprocessors. He is now anindependent consultant. Contact himat firstname.lastname@example.org.
would happen if he tried to catch afield goal. For readers who dont fol-low American football, a field goal isan attempt by the offensive team tokick the ball between the goal posts atthe end of the field. During a field goalattempt, the offense just tries to keepthe defense from blocking the kick.There would be no point in sendingany offensive players downfield, sonobody ever does.
The validator was unswayed by thespec that theres no point in trying tocatch the ball after it has been kicked;in true validation hero fashion, he sim-ply noticed that it didnt seem to beimpossible. After trying for a couple ofhours, he succeeded in doing it. Sincethe games designers had never con-ceived of anyone trying to do such agoofy thing, the games specs didntinclude a requirement for how to han-dle it. As the validator suspected, thegame didnt handle the situation verywellit locked up.
Stepping outside the box Having a validation mindset that
allows an independent thinker to stepoutside a design projects orthodoxy isabsolutely vital. Such people often arethe last line of defense between some-thing thats overlooked in design anda user who isnt happy with the prod-uctor worse.
Although its not a common experi-ence in most peoples daily lives, youknow what this mindset feels like. Anexample borrowed from DouglasHofstadters Fluid Concepts and Crea-tive Analogies (Basic Books, New York,1995) simulates it. Hofstadter believesthat analogies are among the highest
to a solution due to this mismatch;only a live patch from the groundsaved the mission.
If the specs called for double preci-sion, verification of the state-spacecode would never find this potentialproblem: different specs for differentcode. Validation, on the other hand,could reasonably have been expectedto detect this mismatch before theastronauts encountered it hundreds ofmiles above the earth.
Im very fond of validation. Verifi-cation is absolutely necessary, and itsas important (and difficult) a task asdesign or architecture. In my experi-ence, however, the job of making surea design correctly implements its specsisnt likely to be forgotten or over-looked. Verification may not always bedone very well, but it probably wontbe skipped entirely. Validation, on theother hand, is often misunderstood,and management sometimes doesntremember why its so necessary.
Seeing the big pictureIn my experience, after months or
years of intensive design and develop-ment, the designers are tired, theyrebeing pressured to hit their productionschedule, and they just want to finishthe project. Some part of them reallydoesnt want to hear that there mightbe anything wrong with their baby.The architects often have moved intothe initial phase of another design, andthey may not even be available, muchless actively engaged in final testing. Sothe validation folks are the only onesleft who can try to see the Big Picture.
Especially on long design projects,fundamental changes may haveoccurred in the overall product land-scape that havent been fully incorpo-rated into the specs: Competition hasarisen, a lawsuit may have beenresolved in an unfavorable direction,last-minute design changes were madethat wont have benefited from the yearsof testing the design otherwise enjoyed.
Heres an example of what I mean.While testing the Madden NFL 99 PCgame, a validator wondered what
Having a validationmindset that allows anindependent thinker tostep outside a designprojects orthodoxy is
CCC: 0-7803-5957-7/00/$10.00 2000 IEEE
ccc: 0-7803-5957-7/00/$10.00 2000 IEEE
cce: 0-7803-5957-7/00/$10.00 2000 IEEE
Intentional blank: This page is intentionally blank