McCullough and Vinod: The Numerical Reliability of Econometric

Journal of Economic LiteratureVol. XXXVII (June 1999), pp. 633–665

McCullough and Vinod: The Numerical Reliability of Econometric Software

The Numerical Reliabilityof Econometric Software

B. D. MCCULLOUGHand

H. D. VINOD1

1. Introduction

Numerical software is central to our comput-erized society; it is used . . . to analyze fu-ture options for financial markets and theeconomy. It is essential that it be of highquality; fast, accurate, reliable, easily movedto new machines, and easy to use. (Ford andRice 1994)

APART FROM COST considerations,economists generally choose their

software by its user-friendliness or forspecialized features. They rarely worrywhether the answer provided by the soft-ware is correct (i.e., whether the soft-ware is reliable). The economist, whosedegree is not in computer science, canhardly be faulted for this: is it not the jobof the software developer to ensure reli-ability? Caveat emptor. Would a re-viewer notice if the software is inaccu-rate? We think not. We surveyed fivejournals that regularly publish reviews ofeconometric software. For the years

1990–97, over 120 reviews appeared. Allbut three paid no attention to numericalaccuracy, and only two applied morethan a single test of numerical accuracy(Michael Veall 1991, and McCullough1997, but see also Vinod 1989 and ColinMcKenzie 1998). Since computation isthe raison d’etre of an econometric pack-age, this lacuna is all the more puzzlinggiven the failure of many statistical pack-ages to pass even rudimentary bench-marks for numerical accuracy (James Le-sage and Stephen Simon 1985; Bernhard,Herbold, and Meyers 1988; GüntherSawitkzi 1994b; Udo Bankhofer and An-dreas Hilbert 1997). One would thinkthat similar assessments of econometricsoftware have been conducted, but theeconomics profession has no history ofbenchmarking econometric software.

1.1 Econometric Software Has Bugs

Consider full information maximumlikelihood (FIML) estimation of Klein’sModel I using Klein’s original data. Theconsumption (Ct), investment (It), andwage (Wt) equations are given by

Ct = α0 + α1(Wtp + Wt

g) + α2Pt + α3Pt−1

It = β0 + β1Pt + β2Pt−1 + β3Kt−1

Wt = γ0 + γ1Et + γ2Et−1 + γ3(t − 1931).633

1 McCullough: Federal Communications Com-mission. Vinod: Fordham University. For com-ments and useful suggestions, thanks to ReginaldBeardsley, Robert Cavazos, Francisco Cribari,Douglas Dacy, Jerry Duvall, William Greene, DavidKendrick, Robert Kieschnick, David Letson, CharlesRenfro, and Janet Rogers. We are especially in-debted to Frank Wolak and two referees, who madesubstantial contributions. The views expressedherein are the authors’, and not necessarily thoseof the Commission. Email: [email protected] [email protected]

William Greene (1997, p. 760), ErnstBerndt (1990, p. 553), and GiorgioCalzolari and Lorenzo Panattoni (1988)present FIML parameter estimates andstandard errors as found in Table 1. Notethat the magnitude, significance, andeven sign of the parameter estimates dif-fer. Which set of estimates, if any, is cor-rect? How much faith can be placed inthe FIML estimates reported in journalarticles?

As another example, Michael Lovelland David Selover (1994) attempted tofit a Cochrane-Orcutt AR(1) correctionto three data sets by means of four dif-ferent packages. For the first data set,estimates of ρ ranged from 0.36 to–0.79, with slope estimates for the pa-rameter of interest ranging from –35.1to –30.96. For the second data set, ρestimates ranged from 0.31 to 0.93,

with slope estimates ranging from –2.83to 3.06. For the third data set, ρ esti-mates ranged from 0.84 to 1.001, withslope estimates ranging from 0.26 to0.92. Sometimes these differencescould be traced to differences in algo-rithms, in which case the differencesare acceptable. Other times, they couldnot. Paul Newbold, Christos Agiaklo-glou and John Miller (1994) used 15packages to fit ARMA models to fiveseparate time series, with similar re-sults. They also documented cases inwhich two packages generate the sameparameter estimates, but markedly dif-ferent forecasts. This has important im-plications for estimates of the long-runpersistence of macroeconomics shocks(i.e., cumulative impulse responsefunctions), as in John Campbell andN. Gregory Mankiw (1987). Such

TABLE 1VARIOUS FIML RESULTS FOR KLEIN’S MODEL I

(standard errors in parentheses, asterisk denotes 5% significance)

αo a1 a2 α3

Berndt 17.165*(7.363)

0.791*(0.066)

−0.062(1.09)

0.310(0.629)

Greene 17.8*(2.12)

0.853*(0.047)

−0.214(0.096)

0.351*(0.101)

C & P 18.34*(2.485)

0.8018*(0.0359)

−0.2324(0.3120)

0.3857(0.2174)

βo β1 b2 β3

Berndt 29.837(23.77)

−0.625(1.48)

1.020(0.999)

−0.173(0.106)

Greene 17.2*(6.47)

0.130(0.137)

0.140)(0.613)

−0.136*(0.038)

C & P 27.26*(7.938)

−0.8010(0.4914)

1.052*(0.3525)

−0.1481*(0.0299)

γo γ1 g2 γ3

Berndt 4.790(4.82)

0.278*(0.084)

0.257*(0.052)

0.213*(0.078)

Greene 1.41(0.943)

0.498*(0.018)

0.087*(0.015)

0.403*(0.021)

C & P 5.794*(1.804)

0.2341*(0.0488)

0.2847*(0.0452)

0.2348*(0.0345)

634 Journal of Economic Literature, Vol. XXXVII (June 1999)

important results might be quite sen-sitive to the choice of software, apossibility overlooked in textbooks.

When discussing the solution to non-linear estimation problems, textbooksinvariably mention that for a givenproblem, one algorithm might yield asolution, while another algorithm mightfail to produce a solution. Just as invari-ably, no mention is made that whenboth algorithms solve the problem, onesolution is likely to be more accuratethan the other. As our FIML examplemakes clear, there is a distinct possibil-ity that neither “solution” is correct, butthis issue is never raised. Neither dotextbooks warn students that even sim-ple linear procedures, such as calcula-tion of the correlation coefficient, canbe horrendously inaccurate.

Suppose a multiple regression pro-duced a high R2 with low t-statistics. Afirst thought might be “multicollinear-ity.” Standard practice in this situationis to compute the correlation matrix ofthe independent variables. Any conclu-sion regarding the correlation of the in-dependent variables would be criticallydependent upon the package used.Leland Wilkinson (1985, test II-D) hasa test for the accuracy of computing acorrelation matrix. His six variables (X,BIG, LITTLE, HUGE, TINY, andROUND) all are linear transformationsof each other, and so are perfectly cor-related. Therefore the correlation ma-trix should be all units. The standarddeviation of each variable should be2.738 raised to some (possibly negative)power of ten. In Table 2 the output offour popular econometric packages ispresented.2 While package “X1” gives

the correct answer, programs “X2” and“X3” do not. Package “X4” evenmanages to report correlation co-efficients in excess of unity! Wealso note that programs “X2” and “X4”do not correctly calculate all the stan-dard deviations for the variables inquestion.

1.2 Why Has No One Noticed?

It is understandable that economistshave paid little attention to whether ornot econometric software is accurate.Until recently, econometrics texts rarelydiscussed computational aspects of solv-ing econometric problems (but seeRussell Davidson and James MacKin-non 1993, §1.5; Greene 1997, §5.2).Many textbooks convey the impressionthat all one has to do is use a computerto solve the problem, the implicit andunwarranted assumptions being that thecomputer’s solution is accurate and thatone software package is as good as anyother. Statisticians are more likely to beversed in statistical computing and nu-merical analysis and to know that accu-racy cannot be taken for granted. Yeteven reviews of statistical software paylittle attention to accuracy. In somequarters, complaints are raised againstpurveyors of inaccurate software, but aseconomists we take a more sanguineview. While the purveyance of inaccu-rate software is perhaps regrettable, itis predictable: the market provides usnot necessarily with what we need, butwith what we want; and we wantspeed, user-friendliness, and the latesteconometric features.

The market forces that militateagainst the supply of accurate softwareare twofold. First, how often do ad-vertisements for econometric software

2 We have elected not to identify software pack-ages by name for two reasons. First, we regardpublished software reviews as a more suitable ve-hicle for providing full and fair assessments of in-dividual packages. Second, some developers areremarkably quick to respond to reports of errors,

and many of the errors we recount were fixed evenbefore this article went to press.

McCullough and Vinod: The Numerical Reliability of Econometric Software 635

feature “speed of solution”3 as opposedto “accuracy of solution”? Frequentlythere is a trade-off between speed andaccuracy, and often the fastest way tocompute is not the most accurate way tocompute. William Kahan (1997) hasnoted the deleterious effects of thisoverarching emphasis on speed at theexpense of accuracy, and in a fieldwhere one might hope that consumers

would know better: computer science.If the same affliction bedevils the pro-fession of computer science, the eco-nomics profession can hardly befaulted. Second, not only consumers’emphasis on speed, but consumers’ em-phasis on new features militates againstmore accurate econometric software.Much of software development is theincorporation of the latest econometricprocedures. Designing and testing soft-ware for accuracy is extremely labor in-tensive, and the developer who seeks toensure accuracy may have to delay im-plementation of new features. Thus, anaccuracy-enhanced upgrade would lackfeatures possessed by other products.

TABLE 2MATRICES OF CORRELATION COEFFICIENTS

X BIG LITTLE HUGE TINY ROUND st. dev.

X 1.0 2.738BIG 1.0 1.0 2.738LITTLE 1.0 1.0 1.0 2.738E−8HUGE 1.0 1.0 1.0 1.0 2.738E+12TINY 1.0 1.0 1.0 1.0 1.0 2.738E−12ROUND 1.0 1.0 1.0 1.0 1.0 1.0 2.738

“Package X1” (correct answer)


“Package X2”


“Package X3”


“Package X4”

3 Some common measures of “speed” as foundin advertisements and software reviews are mis-leading, such as “the number of seconds requiredto invert a 100x100 matrix 1000 times.” Such loop-ing procedures can exaggerate the influence of thecache dramatically, especially when the cache hitrate reaches 100 percent after the first loop. SeeWeicker (1984).


In a market that demands speed andnew features with little emphasis onaccuracy, the developer who attemptsto enhance accuracy could lose out tothe competition. When numerical accu-racy is not a criterion for evaluation(e.g., Jeffrey MacKie-Mason 1992), aclear signal is sent to software devel-opers that allocating resources towardnumerical accuracy is not a profit-maximizing strategy. Developers aremerely satisfying demand—if we do notdemand some essential feature likenumerical accuracy, it is no one’s faultbut our own.

We believe that if the consumers ofeconometric software were aware of theextent of numerical inaccuracies ineconometric software, developers wouldhave the incentive to spend more timeensuring accuracy, and could do sowithout losing market share. To thisend, then, a goal of this article is toshow consumers of econometric soft-ware that accuracy cannot be taken forgranted, and that conversations abouteconometric software should begin withaccuracy, and only then turn to speedand user-friendliness. This, in turn, willprovide developers the incentive tosupply that accuracy.

1.3 Benchmarking

Thirty years ago James Longley(1967) worked out by hand the solutionto a regression of “total employment”on GNP, the GNP deflator, Unemploy-ment, Size of the Armed Forces, Popu-lation, and Time, for the sixteen years1947–62, and he did so to ten signifi-cant digits. He compared these resultswith those from a variety of mainframeregression packages and discovered thatmost programs produced drastically in-correct answers: “With identical inputs,all except four programs produced out-puts which differed from each other in

every digit” (Longley 1967, p. 822).One program gave one digit ofaccuracy, two gave four-digit accu-racy, and another gave either zeroor one-digit accuracy for each coeffi-cient. Longley traced the source ofmany failures to poor choices of algo-rithms. While Longley wrote thirty yearsago, the lesson learned remains: soft-ware reliability cannot be taken forgranted.

The statistical literature has a longhistory of concern for software reliabil-ity (Ivor Francis, Richard Heiberger,and Paul Velleman 1975; Albert Beaton,Donald Rubin, and John Barone 1976;Francis 1981, 1983; Eddy et al. 1981),and this concern has produced manybenchmarks. In addition to the LongleyBenchmark, Roy Wampler (1980) pro-posed an entire suite of linear regres-sion benchmarks. Lesage and Simon(1985) and Simon and Lesage (1988)constructed benchmark tests for uni-variate summary statistics and theanalysis of variance. Finally, Alan El-liott, Joan Reisch and Nancy Campbell(1989) and P. Lachenbruch (1983) pro-vided benchmarks for elementary statis-tical software packages. This trend mostrecently culminated in Sawitzki (1994),who proposed a testing strategy for as-sessing the reliability of statistical soft-ware. Recognizing that stringent testingis worthwhile only after entry-level testsare passed, he proposed L. Wilkinson’s(1985) tests as a set of minimal stan-dards. Sawitzki (1994a), Wilkinson(1994), and Bankhofer and Hilbert(1997) applied the Wilkinson tests toseveral well-known statistical packages,and the results were less than impres-sive: all products failed some of theseentry-level tests. McCullough (1999b)applied the Wilkinson Tests to severaleconometric packages, which faredabout as well as the statistical packagesdid. That is to say, there is definite


room for improvement in the numericalaccuracy of econometric software.

With respect to accuracy and eco-nomics, a common sentiment is thateconomic data are accurate only to afew digits, and more accuracy than thatis not necessary, so worrying about 10digits of accuracy in econometric calcu-lations is pointless. We partially agreewith this view, but make an importantdistinction. As expressed above, thissentiment conflates how output is calcu-lated and to what end the output isused. A better expression is: since eco-nomic data are accurate only to a fewdigits, reporting 10 digits of a final an-swer is pointless; however, all interme-diate calculations should be carried outto as many digits as possible. The di-chotomy between calculation of outputand use of output is important. It maywell be the case that there is no practi-cal difference between two packages’estimates of a correlation coefficient,ρ1 = 0.95 and ρ2 = 0.92. However, if theexact result is 0.95 and a good imple-mentation of a good algorithm willachieve that result, then package two’scorrelation routine is atrocious. This isbecause it produces only one accuratedigit in a setting where even a fair im-plementation of a fair algorithm willproduce seven digits of accuracy in dou-ble precision. Specifically, ρ2 = 0.92 isevidence of bad software.

Some readers might be dismayed atbecoming aware of the extent of inaccu-racies in econometric software, andtheir confidence in the reliability of nu-merical computation might be shaken.The inaccuracies we recount have longexisted in all manner of computationalsoftware, econometric and other. Onlyrecently have the tools for diagnosingand remedying many of these deficien-cies become available, so only recentlyare users and developers becomingmore aware of these problems. The

foreword to a recent text on the subject(Francoise Chaitin-Chatelin and ValérieFrayssé 1996) addresses this precisepoint: “In some sense, that awareness[of inaccuracy] is no bad thing, so longas the positive aspects of the under-standing of finite precision computationare appreciated.”

Before finite precision computationcan be appreciated, some acquaintancewith how computers handle numbers isnecessary. Therefore, Section Two dis-cusses computer arithmetic and errorsin computation. It emphasizes: (1) thattwo algebraically equivalent methodsmay have drastically different effectswhen implemented on a computer; and(2) that “small” differences in the inputor the algorithm can produce “large”changes in output. Section Three docu-ments inaccuracies in existing econo-metric software, and suggests usefulbenchmark collections for testing gen-eral routines. Section Four argues thatspecialized routines, specific to eco-nomics, need benchmarks. Section Fivediscusses random number generatorsand how to test them. Section Sixdiscusses statistical distributions (e.g.,for calculating p-values) and how totest them. Section Seven offers theconclusions.

2. Computers and Software

Computers are exceedingly preciseand can make mistakes with exquisiteprecision. Certain tasks which are fre-quently repeated, such as rounding asum, need to be exceedingly accurate.Consider rounding to three decimalsa number whose magnitude is aboutone thousand. Should 1000.0006 berounded up to 1000.001 or roundeddown to 1000.000? Perhaps the answeris obvious, but what if the number inquestion was 1000.0005? Both roundingup and rounding down will inject a bias


into the final result, so perhaps theanswer is not so obvious.4

Improper attention to the method ofrounding can produce disastrous re-sults. The Wall Street Journal (Novem-ber 8, 1983, p. 37) reported on the Van-couver Stock Exchange, which createdan index much like the Dow-Jones In-dex. It began with a nominal value of1,000.000 and was recalculated aftereach recorded transaction by calcula-tion to four decimal places, the lastplace being truncated so that threedecimal places were reported. Truncat-ing the fourth decimal of a numbermeasured to approximately 103 mightseem innocuous. Yet, within a fewmonths the index had fallen to 520,while there was no general downturn ineconomic activity. The problem, ofcourse, was insufficient attention givento the method of rounding. Whenrecalculated properly, the index wasfound to be 1098.892 (Toronto Star,November 29, 1983).

2.1 Computer Arithmetic

To a certain degree, all computersproduce “incorrect” answers due to thecomputer’s lack of an infinite wordlength to store numbers. A computer’sarithmetic is different from the onepeople apply with paper and pencil.While people calculate using decimal(base-10) representations for numbers,computers calculate using base-2. As anexample, the decimal number 23 hasthe base-2 representation 10111 since1(24) + 0(23) + 1(22) + 1(21) + 1(20) = 23,which is denoted φ2(23) = 10111. Simi-larly, φ2(10) = 1010 and the decimal 0.5has an exact base-2 representationφ2(0.5) = 0.1 since 0.5 = 1(2−1). The deci-mal 0.1 has an infinite (but periodic)

binary representation with period four,φ2(0.1) = 0.000110011—— where an overbarindicates infinite repetition. However,a computer has finite storage. If it has23 bits of storage to the right of thedecimal, it will hold the decimal 0.1in memory as the binary numberφ2(0.1) = 0.00011001100110011001100110where φ2(⋅) is the stored version of φ2(⋅).Since ϕ2(0.1) is an infinitely repeatingnumber, the stored version φ2(0.1) is notexactly equal to the decimal 0.1 it rep-resents. If the stored binary numberφ2(0.1) is reconverted to decimal, it be-comes 0.09999999403953. Thus, thecomputer “sees” 0.1 as somethingslightly less than 0.1. This has some in-teresting implications. First, it impliesthat rescaling a number by 10 can causea loss of precision, since the exponent isstored base-2 rather than base-10. Sec-ond, reading data and performing aunits conversion is different from read-ing already-converted data, a most dis-comfiting situation for those who en-counter it. These problems arise notdue to the use of base-2 per se, but dueto the combination of base-2 and finiteprecision.

A floating point binary number has afixed number of places, say 24, with adecimal point which can be placedanywhere. For example, φ2(10) =00000000000000000001010.0, φ2(0.1) =0.00011001100110011001100 and thebase-2 representation of their sum isφ2(10.1) = 1010.0001100110011001100.When this sum is reconverted to deci-mal, it becomes 10.0999985. A realnumber is represented in the floatingpoint format: s × M × BE; where s is thesign (zero for positive or unity for nega-tive), B is the base of the representation(usually 2), E is the exponent, and M isa positive integer mantissa. A usefulway to view this is as M being a string ofzeroes and ones, with BE placing a deci-mal point somewhere in the string (or

4 A common solution is to round to the nearesteven digit, e.g., 1.005 becomes 1.00 while 1.015becomes 1.02. This scheme is called “round-to-even.”


to the left or right of the string, implic-itly padding with zeroes). If the leadingdigit is non-zero, the number is said tobe normalized. The representations ofφ2(10) and φ2(0.1) above are not normal-ized, whereas that of φ2(10.1) is. Whileunder certain circumstances interme-diate calculations might involve non-normalized numbers, results are storedin normalized format. Since the first digitin normalized format is always unity, weget an extra bit for the mantissa,referred to as the “hidden bit.” 5

For single-precision, a 32-bit wordmight be partitioned as follows: one bitfor the sign, eight bits for the exponent(which can take integer values from–126 to 127) and 23 bits for the man-tissa which, with the hidden bit, yieldsa 24-bit mantissa, sometimes referredto as “23 + 1” to indicate that the 24thbit comes from the hidden bit. Themagnitudes of such single-precisionfloating point numbers are constrainedto lie between 2−126 ≈ 1 .2 × 10−38 and(2 − 2−23) × 2127 ≈ 3.4 × 1038, the range ofrepresentable numbers, with the under-flow threshold and overflow thresholdas lower and upper limits, respectively.

Let R be the familiar real numberline, and let F be the representablenumbers. Let fl(x) be the floating pointrepresentation of x. That is, fl(x) is thenumber in F that corresponds to somenumber in x ∈ R. For example, if x =0.1 then fl(x) = 0.09999999403953.. Thusit is not always true that x = fl(x), i.e.,that x ∈ R ⇒ x ∈ F. Even when x, y ∈ Fit does not follow that (x + y) ∈ F. This isa first suggestion that computer arith-metic is different from pencil-and-pa-per arithmetic. Let us define floatingpoint addition by fl(x + y). Ideally,fl(x + y) is the number in F that isclosest to x + y ∈ R. The difference,

fl(x + y) − (x + y), is rounding error, andthis difference depends in part on thevalues of x and y since the repre-sentable numbers are not uniformly dis-tributed in base-10. In single precision,there are 8,388,607 floating point num-bers between 1 and 2, while between1023.0 and 1024.0 there are 8,191 float-ing point numbers. Thus, it can be ex-pected that numbers with larger magni-tudes are more susceptible to roundingerror than numbers with smaller magni-tudes; this supports the recentering andscaling of numbers to mitigate the ad-verse effects of floating point arithme-tic, as is frequently suggested in discus-sions of collinear data. While frequentlyit is true that fl(x + y)∈F, much less fre-quently is it true that fl(x ⋅ y)∈F forfloating point multiplication, since theproduct involves 2s or 2s – 1 significantdigits, which extend beyond the s-digitmantissa. Therefore, floating point mul-tiplication is much more likely to becontaminated by rounding error thanfloating point addition. All this impliesthat computer arithmetic, contrary tothe familiar pencil-and-paper arithme-tic, is neither associative nor distrib-utive, though it is commutative. Oneconsequence of this, shown in sub-section 2.3, is that absent special pre-cautions, it is possible for the logicalstatement (x = y) to evaluate “false”while (x – y = 0) evaluates “true.” Insuch a case, logical tests of equalitycannot be trusted.

Let ⊕ denote any of the four arithme-tic operations: + − × ÷ and let fl(x ⊕ y)be the floating point representation ofx ⊕ y. Then machine precision, ε, is thesmallest value satisfying

s(x ⊕ y) − (x ⊕ y) ≤ εx ⊕ y (1)

i.e., it is the smallest value for which (1)holds for all ⊕ and for all x and y suchthat the magnitude of x ⊕ y is neithergreater than the overflow threshold nor

5 Not all computers use the hidden bit, but it ispart of IEEE-754.


less than the underflow threshold. Forstandard single precision, machine epsi-lon is εS = 2−24 ≈ 5.96 ⋅ 10−8 and for doubleprecision, which uses a 64-bit word-length, it is εD = 2−53 ≈ 1.11 ⋅ 10−16. Accu-racy refers to the error of an approxima-tion, whereas precision refers to theaccuracy with which basic arithmetic op-erations are performed. They are thesame for scalar computation, e.g.,c = a ⋅ b, but precision can be better thanaccuracy for nonscalar operations such asmatrix inversion; since precision refersto the result of a single calculation, whileaccuracy can refer to the result of severalcalculations.

2.2 Errors in Computation

When two floating point numbers areadded, the smaller (in magnitude) num-ber is right-shifted until the exponentsof the two numbers are equal, and thenadded (this procedure is referred to asnormalization). While this method doesnot waste any bits of the mantissa andhence preserves accuracy, the least sig-nificant bits of the smaller number arelost when it is right-shifted. This is anexample of roundoff error or roundingerror. Roundoff error is a functionof hardware and exists because a com-puter has a finite number of significantdigits with which to represent real num-bers. For given numbers x, y, and z,when the computer is asked to calculatexy + z what it actually returns isw = (xy(1 + α) + z)(1 + β) where α and βare rounding errors. Usually there arebounds for the rounding errors, such asα < 2−53 and β < 2−53, and these canbe used to bound the total error be-tween the computed w and the actualvalue of xy + z. Such bounds are neces-sary because clearly w ≠ xy + z due torounding error, and in fact E[w] ≠ xy + z.Unknown distributional forms andcorrelations hamper statistical analysis

of rounding errors.6 It makes sensethat there are only bounds: if α and βwere actually known, they could besubtracted off to yield an exact result.

Even when two numbers are preciseto several digits, a single arithmetic op-eration can introduce sufficient error toensure that the result is precise to nosignificant digits. One such occasion isthe special case of rounding error calledcancellation error, which occurs whentwo nearly equal numbers are sub-tracted. Nicholas Higham (1996, p. 10)provides an illustrative example. Forf(x) = (1 − cosx) ⁄ x2 let x = 1.2 × 10−5. To tensignificant digits, cos x = 0.9999999999,so 1 – cos x = 0.0000000001 and f(x) =0.6944 . . . , though by the definitionof cos x the following inequality is true:0 ≤ f(x) ≤ 0.5 for all x. The subtractionis exact, but the error in the sole non-zero digit in 0.0000000001 is of thesame order as the true answer, and thusthe truth is not visible in the final an-swer. Determining when such adverseresults can and cannot happen is thefield of error analysis: what proportionof the final answer is truth and whatproportion is error. Underlying thenotion of error analysis is the ideaof how accurately each calculation isperformed, and how the error fromeach calculation propogates throughsubsequent calculations.

Even if computers had an infinitenumber of significant digits and so hadno roundoff error, another type of error

6 Make the simplifying assumption that round-ing errors are independent of x, y and z (they arenot, because the floating point numbers are notuniformly distributed; the rounding error issmaller when the magnitude of the number issmaller, as seen in Section 2.1). Trivially, ifE[α] = E[β] = 0 then E[w] = xy + z + xyE[αβ], andE[αβ] ≠ 0 because rounding errors are neitherindependent nor uncorrelated. Asymptotic theoryfor dependent and correlated sequences is of littlehelp, because rounding errors routinely violate theLindberg condition: often a few rounding errorsdominate the final error.


would exist as a function of software:truncation error. It exists because aprogram uses finite term approxima-tions. The analysis of truncation errorcan be said to constitute much of thefield of numerical analysis. Consider apower series in x whose infinite sum issin(x). A computer will truncate the in-finite sum by including only a finitenumber of terms, and in doing so willcommit a truncation error. We note inpassing that some power series forsin(x) converge quickly while others donot, thus the choice of algorithm can becrucial. Also, the rate of convergencecan depend upon x, converging fasterfor some values than for others. Indeed,even the order in which summation isundertaken can affect the quality of theresult. Germund Dahlquist and AkeBjorck (1974) showed that in calculat-ing Σn = 1

10,000 n−2, reversing the order leadsto an error 650 times smaller than sum-ming over increasing n (see alsoHigham 1996, §4.2–4.5). This is becauserounding error accumulates more slowlywhen small terms are added first. Sum-ming a very large number of very smallquantities is not just an exercise fortesting software; numerical integrationplays an important role in econometrics.Numerical integration becomes particu-larly difficult when higher moments areinvolved, because the numerical errorbecomes more severe. See Vinod andShenton (1996) for a discussion.

Whether via truncation or roundoff,error is introduced into most any resultof an arithmetic operation. As a generalrule, successive errors do not offseteach other; they accumulate, and thiscumulative error is an increasing func-tion of the number of operations. Gen-erally, the total error is of order Nεwhere N is the number of operations. Inthe special case that errors tend to be ofopposite sign, the cumulative effect ofsuch errors does not disappear, but pro-

duces a total error of order √N ε. Wenote that there are multiple precisionarithmetic routines available, which cancarry out calculations to 500 digits, thuseffectively eliminating roundoff errorfor many types of problems. However,they are quite specialized and notgenerally used by economists.

Two methods of solving the sameproblem can differ dramatically in thenumber of operations. Consider solvingn equations in n unknowns. We all knowhow to solve such a problem usingCramer’s Rule. It can be shown that thenumber of multiplications and divisionsnecessary to solve the system is(n2 − 1)n! + n. The method of Gaussianelimination requires only n

6(2n2 + 9n − 5)such operations. For n = 5 Cramer’sRule requires 2885 multiplicationswhereas Gaussian elimination requiresonly 75. With noncancelling errors andmachine precision 2−22, the orders ofthe approximations are 0.0006874 and0.00001788, respectively. However, ifn = 10, Gaussian elimination requires475 operations with error order0.0001132 while Cramer’s rule requiresapproximately 360 million operationswith error order 85.831. Clearly, sys-tems of equations should be solved byGaussian elimination rather than Cra-mer’s rule (though for econometricwork, other methods are preferred toGaussian elimination). Prima facie it isintuitively desirable to employ methodswith fewer operations, not only for thesake of speed but also for the sake ofaccuracy. This is not always so—some-times more operations are desired notfor sake of speed but for sake of accu-racy, e.g., in the calculation of the sam-ple variance (see subsection 2.5). More-over, not only are some algorithmspreferable to others, at an even morefundamental level, some methods ofperforming calculations are preferableto others.


When a calculation produces a num-ber that is too large, the result is over-flow, and a number that is too smallproduces underflow. To see overflow,on a handheld calculator with eightdigits and without scientific notation,square 99,999,999. Dividing unity by99,999,999 yields underflow. Overflowis a very stark process, usually resultingin noticeable program failure. Under-flow can be pernicious, because whenhandled improperly it can result in sen-sible-looking answers that are com-pletely inaccurate. One method of han-dling underflow is known as abruptunderflow (a.u.), in which numbers thatare sufficiently small but nonzero oftenare automatically set to zero. One un-fortunate consequence of abrupt under-flow is that x – y = 0 does not implyx = y (David Goldberg 1991), and logi-cal tests for equality cannot be trusted.To see how this can happen, it is easiestto use decimal arithmetic rather thanbase-2.

Suppose, then, that the base is 10,the mantissa has three digits, and theexponent ranges from –16 to +15, sothat the smallest representable floatingpoint number, say fpmin, is 1.00 × 10−16.Consider two numbers, x = 5 .28 × 10−15

and y = 5 .23 × 10−15, both greater thanfpmin by a factor of 10. A test of x = ywill return false. Paradoxically, a test ofx – y = 0 will return true. The reasonfor this is that x − y = 0 .06 × 10−15 =6.0 × 10−17 when normalized, which issmaller than fpmin and hence is set flushto zero, i.e., subjected to abrupt under-flow (a.u.). In the presence of a.u., notheorems requiring x – y = 0 ⇐⇒ x =y can be used to prove anything, sincethis relation is then only sometimestrue, not always true.

As a means of handling the inaccu-racy of a.u., I. Goldberg (1967) pro-posed the method of gradual underflow(g.u.). Continuing the above example,

when the exponent is at its minimum,–16, fpmin no longer is the smallest rep-resentable number, because 0.99 × 10−16

is smaller than fpmin. Note, however,that 0.99 × 10−16 is not normalized, sosuch floating point numbers are re-ferred to as subnormals. Subnormalnumbers are part of IEEE-754 (IEEE1985), the rules for computer arithme-tic on which hardware has stan-dardized.7 Virtually all PC and worksta-tion hardware supports the use ofsubnormal numbers, though some soft-ware does not take advantage of thisfeature. Prior to Demmel (1981), it wascommonly thought that a.u. is “harm-less” and that the choice of either a.u.or g.u. was innocuous. One of the ad-vantages of IEEE-754 is that it sup-ports g.u., which is a necessary condi-tion for x – y = 0⇐⇒ x = y. Perhaps moreimportantly, g.u. achieves greater nu-merical reliability for solving linearsystems of equations than a.u. does.

To see this, consider the LU decom-position of a matrix (another way tosolve for the least squares coefficients),which produces a lower triangular (L)and upper triangular (U) factorizationof a matrix A. Recall from matrix theorythat if A is non-singular then the maindiagonals of both L and U are non-zero.Demmel (1981) gives the followingexample. Let

A = λ

21

32

(2)

which is clearly well-conditioned formatrix inversion. Using g.u. produces

7 As of this writing, some conforming computersare: PCs based on Intel 386, 387, 486, Pentium,and P6 processors and associated clones by Cyrix,IBM, AMD, and TI; Macintosh based on Motorola68020 + 68881/2 or 68040; IBM RS/6000; Power-PC based PCs and Macintoshes; Suns based on M68020 + 68881/2 or SPARC chips; DEC Alphabased on DEC 21064 and 21164 chips; Cray T3Dbased on DEC 21064; and HP based on PA-RISCchips. Exceptions are: Cray X-MP, Y-MP, C90,J90; IBM /370 and 3090; and DEC VAX.


LguUgu =

11 ⁄ 2

01

⋅ λ ⋅

20

3

1 ⁄ 2 = A (3)

whereas a.u. yields

LauUau =

11 ⁄ 2

01

⋅ λ ⋅

20

30

≠ A (4)

Thus, g.u. produces the correct factoriza-tion while a.u. incorrectly attributes sin-gularity to the decidedly non-singularmatrix A. It is known that a computerthat does not support g.u. cannot satisfysome of the LAPACK (Linear AlgebraPACKage) benchmarks. To the extentthat nonlinear solvers are based on linearapproximations, it can be expected that,ceteris paribus, g.u. is also better thana.u. for solving nonlinear equations.

Though many chips fully supportIEEE-754 and many compilers and lan-guages support some aspects of IEEE-754, no compiler (as of this writing)fully supports IEEE-754, nor does anylanguage (though FORTRAN 90 ismore 754–compliant than FORTRAN77, and similarly for FORTRAN 95 withrespect to FORTRAN 90).8 Nonethe-less, there are many languages and com-pilers that support g.u., while there aremany that do not. Further results onunderflow and numerical reliability, inaddition to more examples, can befound in Demmel (1984).

2.3 Misconceptions of Floating-Point Arithmetic

Having discussed some of the intrica-cies of computer arithmetic and floatingpoint calculations, it is instructive todispel what Higham (1996) calls “Mis-conceptions of Floating Point Arithme-tic,” of which we mention only three.We mention these not only to illumi-

nate the discussion of floating-pointarithmetic, but to drive home the fun-damental point that computer math isnot at all like pencil-and-paper math.9

Misconception Number One: A short compu-tation that is free of cancellation error, over-flow, and underflow must necessarily beaccurate.

Consider the following six lines ofcode:

for i = 1:60 x = x**0.5endfor i = 1:60 x = x**2end

where x∗∗0.5 ≡ √x and x∗∗2 ≡ x2. Notethat this algorithm is free from cancella-tion error, underflow, and overflow.Calculated without error, this algorithmwill return x for any non-negative xthat is entered, i.e., this algorithm rep-resents the function f(x) = x, x ≥ 0. How-ever, as seen in the previous subsection,computers necessarily calculate with er-ror, and so there is reason not to be sur-prised if the computed function, f(x), dif-fers from the theoretical function f(x).On Brand X programming software, thisalgorithm actually computes not f(x) but

f1(x) =

0,1,

0 ≤ x < 1,x ≥ 1. (5)

On Brand Y programming software a dif-ferent answer is obtained:

f2(x) =

0,1,

x = 0,x > 0. (6)

Observe that f1 and f2 are completely dif-ferent functions. Inputting the numbers

8 IEEE-754 was developed by hardware special-ists; programming language and compiler special-ists were not involved. Some aspects of IEEE-754therefore are particularly difficult for compilersand languages to support, but much progress isbeing made.

9 See, for example, George Forsythe’s (1970)aptly titled article, “Pitfalls in Computation, orWhy a Math Book Isn’t Enough,” for an elemen-tary discussion of why ideas which work in mathe-matical theory often fail in computational practice.A related article is L. Fox’s (1971) “How To GetMeaningless Answers in Scientific Computation(and What To Do About It).”


0.0, 0.1, 0.2, . . . , 1.0, f1 returns 10zeroes followed by unity, while f2 returnsa zero followed by 10 units. One mightbe tempted to think that, surely, one orboth answers indicate “bad” software; infact such a conclusion would be unwar-ranted. There is nothing wrong with thesoftware. The problem is simply that thealgorithm exhausts the computer’s preci-sion and range. A user should alwayshave some idea of the software’s preci-sion and range, and whether his combi-nation of algorithm and data will exhaustthese limits.

As a trivial example, a user should notattempt to manipulate ten-digit num-bers on a program that uses single pre-cision storage.10 At the other extreme, aresearcher with a hundred thousandobservations might choose not to run aregression on a PC. With so manyobservations, the potential for disas-trous cumulated rounding error is anevident concern. Least squares algo-rithms that are generally robust tend tobe memory-intensive, and a PC mightnot have enough memory to solve alarge system with such an algorithm. Aless-robust least squares algorithm withless-demanding memory requirementsmight be able to produce a solution,but the solution might well be so con-taminated with rounding error as to becompletely unreliable.11

Misconception Number Two: Increasing theprecision at which a computation is per-formed increases the accuracy of the answer.

Equation (1) shows that the bound onthe error is proportional to machine ep-silon. When the error bound is attained,

if the same problem is solved in singleprecision and then again in double pre-cision, the double precision error boundwill be smaller than the single preci-sion error bound by a factor ofεD ⁄ εS = 2−53 ⁄ 2−24 ≈ 10−9, i.e., the doubleprecision answer will have approxi-mately nine more decimal digits correctthan the single precision answer. How-ever, actually attaining the error boundis a low probability event, so there is noguarantee that a result computed in tdigits of precision will be more accuratethan a result computed in s digits ofprecision for t > s (Higham 1996,§1.13).12 The “convenient fiction” thatincreasing the precision necessarily in-creases the accuracy is just that; nor isit a pathological case that would neverhappen in practice.

As an example that this is more than atheoretical curiosum, Longley’s resultswith an IBM 360 were more accurate insingle precision than in double preci-sion (see Table 10 of Longley 1967, p.837). Had Longley not worked out thecorrect answer by hand, many personsnaturally would assume that the doubleprecision estimates were more accuratethan the single precision estimates.When a user says that he encounteredroundoff error in single precision andthen switched to double precision to geta better answer, a better interpretationof what he really means is provided byWilliam Press et al. (1992, p. 882), “Forthis particular algorithm, and my par-ticular data, double precision seemedable to restore my erroneous belief inthe ‘convenient fiction’.” All this is not

10 Some packages have “single precision storage”with “double precision calculation.” The use of“single precision storage” can have adverse effectson accuracy. See McCullough (1999a).

11 This illustrates one of the many dilemmasconfronting developers: how to balance the choiceof algorithm against the need to handle largedatasets. Further discussion of the developer’sviewpoint can be found in Charles Renfro (1997).

12 J. H. Wilkinson (1963, pp. 25–26) describesthe circumstances under which the bound is at-tained for floating-point multiplication. Not onlymust each individual error attain its maximum, butthe distribution of the multiplicands must follow aspecial law. Taken together, these imply that at-taining the bound is a low probability event. Seealso William Kennedy and James Gentle (1980,pp. 32–33).


to suggest that double precision shouldbe abandoned in favor of single preci-sion; far from it. The point is simplythat, contrary to what intuition based onpencil-and-paper experience might tellus, double precision is not always moreaccurate than single precision.

Misconception Number Three: The final com-puted answer from an algorithm cannot bemore accurate than any of the intermediatequantities; that is, errors cannot cancel.

Earlier we suggested that roundingerrors do not cancel. However, that is ageneral principle, and a clever program-mer can force exception to the rule.We now demonstrate this notion.Consider f(x)=(ex − 1) ⁄ x. The seeminglynatural way to program this function isAlgorithm 1:

if x = 0 f = 1else f = (e**x - 1)/xend

but an alternative is Algorithm 2:y = e**xif y = 1 f = 1else f = (y-1) / ln(y)end

If x = 9 × 10−8 and final precisionu ≈ 6 × 10−8, then a precisely calculatedsolution is f(x) = 1.00000005. Algorithm1, the natural way to program the func-tion, yields 1.32454766 while Algorithm2 yields 1.00000006. The reason thenatural method fails miserably is becausethe intermediate step of the calculationis imprecise: for x ≈ 0 the numerator isswamped by rounding and cancellationerror. Again let a circumflex (^) denote acomputed quantity. In Algorithm 2 it istrue that the intermediate quantitiesy − 1 and ln y do not approximate y – 1and ln y for y ≈ 1. Nonetheless,

(y − 1) ⁄ ln y is an extremely good approxi-mation to (y − 1) ⁄ ln y in that range, be-cause the latter varies slowly and in facthas a removable singularity at the pointy = 1. This demonstrates that the “natu-ral” way to program an equation may notbe the computationally accurate way.Users and developers need to be cogni-zant of this important point. As an exam-ple of this phenomenon, not infrequentlya nonlinear solver will fail for one para-meterization of an equation, but will pro-duce a solution for another equivalentparameterization.

Again it is seen that computer arith-metic, when performed properly, maynot be at all like paper-and-pencil arith-metic. Neither are mistakes computersmake like the ones humans make. Withpaper and pencil, the order in whichseveral numbers are added makes nodifference to the final sum; not so withcomputers. In our arithmetic, if x – y = 0then x = y, but not in computer algebraunless the computer supports grad-ual underflow. Two formulae that areequivalent in our algebra are not neces-sarily equivalent when programmedinto a computer. Such differences asthese mean that our everyday notions ofarithmetic errors do not coincide withthe arithmetic errors made by comput-ers. Even when the arithmetic is goodand the bounds on the truncationsand rounding errors are tiny, cautionstill must be observed, for “small”differences can matter appreciably.

2.4 “Small” Differences Matter

Recall the dictum from the days ofpencil-and-paper computation: if youwant one decimal of accuracy, carryyour calculations out to two decimals,i.e., if you want n digits of accuracycarry out the calculations to n + 1digits. An example that illustrates thefalsity of this dictum is due to J. Van-dergraft (1983), in which carrying


calculations out to four digits produceszero digits of accuracy. Consider theregression Y = α + βX + ∈ where

X: 10.00, 10.10, 10.20, 10.30, 10.40, 10.50.Y: 1.000, 1.200, 1.250, 1.267, 1.268, 1.276.

The normal equations are

6α + 61.5β61.5α + 630.55β

= 7.261 = 74.5053

which yields a regression line ofY = −3.478 + 0.457X. Suppose that thenormal equations13 had been computedonly to four significant digits as

6α + 61.5β61.5α + 630.6β

= 7.261 = 74.50.

The resulting regression line would havebeen Y = −2.651 + 0.377X. Changing thefourth significant digit of the normalequations changed the first significantdigit of a coefficient, resulting in no dig-its of accuracy for either the intercept orthe slope. The two regression lines fitquite differently when cast against ascatterplot of X and Y. This example fur-ther dispels the errant notion that “arith-metic more precise than the data it oper-ates on is needless.” To the contrary, asLongley made clear, carrying calcula-tions out to single or double precision isno guarantee of even a single digit of ac-curacy, and the choice of the properalgorithm matters, too.

To understand how results can be ad-versely affected, realize that roundofferror and truncation error, if combinedin an unstable method (e.g., repeateddivision of a large number by a smallnumber) can interact until the com-bined error overwhelms the result. Con-sider the system of equations Xb = ywhere the matrix X and vector y areknown and the vector b is to be calcu-lated. From basic econometrics we

know that b = (X′X)−1X′y. If X is changeda little and b changes a little then thedata are well-conditioned; if b changes alot then the data are ill-conditioned. Intraditional least-squares, if the data ma-trix X consists of collinear data, thenX might be an ill-conditioned matrixand solution could be problematic. Amethod of obtaining b that works forwell-conditioned data might not work atall for ill-conditioned data. The notionof condition is not limited just to data;entire classes of problems can be said tobe ill-conditioned, such as Fredholmequations of the first kind.

As another demonstration that smalldifferences matter, Beaton, Rubin, andBarone (1976) perturbed the indepen-dent variables of Longley’s data byadding a random number uniformly dis-tributed on [-0.5, 0.499] to the last pub-lished digit. For example, 1947 GNPdeflator is reported as 83, but it mightwell have been anywhere between 82.5and 83.499. Then a regression was run.This procedure was repeated 1000times. Since the perturbation of the in-dependent variables is within roundingtolerance of the data, one might expectlittle change in the estimated coeffi-cients. However, the Longley data areill-conditioned and so a small change inX can produce a large change in b. Inthe 1000 regressions, the coefficient onthe GNP deflator assumed values from–232.3 to 237.0. The ill-conditioned na-ture of the Longley data is apparent inTable 3, which presents Longley’s coef-ficients and statistics for the coeffi-cients from 1000 perturbed regressions.One might think that the means of thecoefficients from these 1000 perturbedregressions would be near Longley’s un-perturbed solution, but such is not thecase. See Vinod (1982) for a discussionof the numerical and statistical issuesinvolved. As much as it highlights theneed for accurate data, this experiment

13 Of course, no one should use the normalequations to solve for regression coefficients.While the normal equations work algebraically,computationally they are a disaster.


also underscores the idea that small dif-ferences can matter appreciably and theneed for accurate algorithms.

2.5 Algorithms Matter

When solving y = Xb we know thatb = (X′X)−1X′y. This is what the leastsquares result looks like, but is not howit should be computed. The person un-familiar with numerical methods shouldbe forgiven for concluding that the wayto calculate the least-squares coefficientvector b is to calculate (X′X)−1 and thenpost-multiply it by X′y (multiplying(X′X)−1 by an estimate of the error vari-ance yields cov(b)). However, direct so-lution of the normal equations is verysusceptible to roundoff error and henceextremely undesirable. Vinod (1997, p.218) criticizes the “estimating function”literature for ignoring the numerical in-stability associated with direct solutionof the normal equations. Numerically, abetter way is first to calculate b withoutinverting X′X, and then use b to calcu-late (X′X)−1 and cov(b). Wampler (1980)discusses various methods.

One such method of finding b with-out inverting X′X is the LU decomposi-tion (Press et al. 1994, §2.10), whichwill work on a well-conditioned matrix,but is prone to failure if the data arecollinear. For a near-singular matrix,the singular value decomposition (SVD,Press et al. 1994, §2.6) will work,though it is slower (requires more op-

erations) than the LU decomposition.In regression analysis, the SVD is themethod of choice (Sven Hammarling1985; Press et al. 1992). Vinod and Ul-lah (1981, p. 5) is one of the few econo-metrics books that recommends theSVD estimate of the regression prob-lem. The SVD should be used until onehas proved that a faster algorithm willbe adequate or necessary.14 If it cannotbe shown that the data will always bewell-conditioned, they must be pre-sumed to be ill-conditioned. To assumeotherwise lulls the user into a falsesense of trust and then, at some randomtime, he is unknowingly betrayed. Sur-prisingly, few packages mention whatalgorithm is used to calculate b andcov(b), and clearly a poor choice of ma-trix inversion method can lead to disas-ter, as was made clear by Longley. As ageneral rule in computing, the methodof solution greatly affects the accuracyof the solution, and properly imple-menting a well-chosen algorithm is im-portant. While describing how differentmethods of calculating b and cov(b) canlead to different answers is beyond thescope of this paper, the basic ideas canbe illuminated by demonstrating how

TABLE 3LONGLEY RESULTS AND SUMMARY STATISTICS ON 1000 PERTURBED REGRESSIONS

coefficient β0 β1 β2 β3 β4 β5 β6

Longley –3482258.63 +15.06 –0.04 –2.02 –1.03 –0.05 +1829.2Beaton, et al.

mean –1152648.37 –26.44 +0.03 –0.96 –0.72 –0.28 +637.1minimum –3483280.69 –232.3 –0.09 –2.42 –1.33 –0.94 –1706.9maximum +3452563.48 +237.0 +0.20 +1.77 +0.36 +0.48 +1800.9

Source: Beaton, Rubin, and Barone (1976).

14 We have the luxury of concerning ourselvessolely with accuracy. As a practical matter, itshould be noted that the SVD is both memory-and time-intensive. A developer cannot be faultedfor implementing the QR decomposition in itsstead. We admit to preferring the QR to the SVDwhen bootstrapping.


different methods of calculating thevariance of a series can lead to differentanswers.

Robert Ling (1974) considered fivemethods for calculating the variance, ofwhich we present three. The least pre-cise he showed to be the “calculator”formula (since it requires fewer key-strokes, it often is suggested as analternative formula for calculation):

V1 = Σx2 − (1 ⁄ n)(Σx)2

n − 1 (7)while the standard formula

V2 = Σ(x − x–)2

n − 1 (8)is more precise because V1 squares theobservations themselves rather thantheir deviations from the mean, and indoing so loses more of the smaller bitsthan V2. Specifically, V1 is much moreprone to cancellation error than V2. Yeta third, less familiar, formula is the“corrected two-pass method”:

V3 = 1

n − 1 ∑ (xi

i = 1

n

− x–)2

−

1n

∑ (xi

i = 1

n − x–)

2 (9)

which is designed to account for round-ing error. Algebraically, the right-mostterm equals zero, but computationally itwill not. Again, pencil-and-paper arith-metic is not like computer arithmetic. Ifcalculation of the mean is exact (with norounding error) then the second term onthe right-hand side equals zero and V3

collapses to V2. In the presence ofrounding error, V3 is likely to be moreaccurate than V2.

Sometimes a developer claims to useone formula when he actually uses an-other, or does not specify a formula atall. How might the quality of the vari-ance-calculating algorithm be assessed?Calculate the sample variance of three

observations: 90000001, 90000002, and90000003 (obviously the correct answeris 1.0). L. Wilkinson and Gerard Dallal(1977) used this test on a variety ofmainframe packages and found that allbut one failed to produce the correctanswer. Many econometric packages failthis test. A disingenuous counter-argu-ment is that the data can always be re-scaled to take better advantage of thefloating point representation. The ap-propriate rejoinder is: “How do youknow that you need to rescale in thefirst place?” Our practical advice is:Test the software.

This, though, is the usual state of af-fairs in using econometric software:performing usual operations on aneconometric package usually will giveno indication that anything is wrong un-less the package produces a glaring er-ror such as a negative variance or R2 > 1.For example, when computing correla-tion matrices, users of package “X2” or“X3,” absent prior benchmarking, wouldhave no inkling that their program wasfaulty; a user of package “X4,” since thecorrelations were greater than unity,might suspect something. Interestingly,each of the four packages passes theLongley Benchmark—that a programdoes one thing correctly is no assurancethat it does another thing correctly.

The above discussion may convey theimpression that computer arithmetic ishopelessly imprecise, but nothing couldbe farther from the truth. Indeed, rulesfor computer arithmetic, if rigorouslyimplemented, can lead to theorems andproofs of accuracy. The problem is notthat the computer is inaccurate; it ishow the inaccuracy is handled. If the in-accuracy is handled properly, then theo-rems can be proved which define anarithmetic for floating point computa-tions and so assure the final accuracy ofan answer. See David Goldberg (1991)for an extended layman’s discussion and


examples of such theorems. Additionaluseful references for computer arith-metic are Kennedy and Gentle (1980,ch. 3) and Ronald Thisted (1988, ch. 2).More generally, for the economistinterested in matters computational,we highly recommend Kenneth Judd’s(1998) recent textbook NumericalMethods in Economics.

3. General Benchmarks

Given the history of benchmarkingstatistical software, simply trusting thesoftware represents the triumph ofhope over experience and is an invita-tion to disaster. This, then, is the entirepurpose of benchmarking: with giveninputs and correct outputs in hand, theprogram is supplied with the input andits output is checked against the correctanswer. Exact accuracy is not de-manded, but the program’s answershould be close enough to the correctanswer. This develops a knowledge ofwhere the program errs and whetherthe error is likely to affect results. Firstexamine the program’s general proce-dures. After determining that they arereliable, then examine specific econo-metric procedures based on the generalprocedures. For example, first deter-mine that the linear least squares rou-tine “works.” Then check that a special-ized procedure based on linear leastsquares, e.g., calculation of autocor-relation coefficients, is implementedproperly.

Following Sawitzki (1994), we recom-mend the use of the Wilkinson Tests forentry-level purposes. Beyond that, inSection One we mentioned severalbenchmarks for least squares proce-dures, such as univariate summary sta-tistics, analysis of variance, and linearregression. An obvious gap in the litera-ture has been benchmarks for nonlinearleast squares procedures. This defi-

ciency was remedied by the InformationTechnology Laboratory of the Statisticaland Engineering Division at the Na-tional Institute for Standards and Tech-nology (NIST), which recently releasedits “Statistical Reference Datasets”(StRD), a collection of benchmarks forstatistical software.15 While NIST hasplans to expand the number of bench-marks, at this writing it contains 58benchmarks in four suites: univariatesummary statistics (9 benchmarks),analysis of variance (11), linear re-gression (11), and nonlinear regression(27).

To circumvent rounding error prob-lems, NIST used multiple precision cal-culations, carrying 500 digits for linearprocedures and using quadruple preci-sion for nonlinear procedures. Com-plete computational details and a dis-cussion of test problem selection are inJanet Rogers et al. (1998). The results,rounded to fifteen digits for linear pro-cedures and eleven digits for nonlinearprocedures, are referred to as “certifiedvalues.” Certified values are providedfor a number of statistics for each suite.For example, for univariate summarystatistics the mean, standard deviation,and first-order autocorrelation coeffi-cient are given to fifteen places. Whilethe output from so many tests is volumi-nous, McCullough (1998) describedhow to condense the output, and ap-plied the benchmarks to three popularstatistical packages (McCullough 1999).While the Longley lesson has beenlearned—all packages did well on linearregression benchmarks—gross errorswere uncovered in analyses of varianceroutines (including negative sums ofsquares), and some programs producedsolutions to nonlinear problems thathad zero digits of accuracy. One of the

15 On the web at http://www.nist.gov/itl/div898/strd


statistical packages solved all 27 of theStRD nonlinear benchmarks, and an-other solved 26 of 27, returning one in-correct answer. Of course, there aretwo ways a nonlinear procedure can fail,as described by W. Murray (1972, p.107): “The first is miserable failurewhich is discovered when an exasper-ated computer finally prints out a mes-sage of defeat. The second is disastrousfailure when the computer and trustinguser mistakenly think they have foundthe answer.”

McCullough (1999a) also applied theStRD to econometric software. Again,the Longley lesson has been learned,but the problems with nonlinear estima-tion appear to be more severe witheconometric software than with statisti-cal software. At one extreme, oneeconometric package correctly solved26 of the 27 nonlinear estimation prob-lems, once producing a miserable solu-tion (of course, no solution is betterthan an incorrect solution). At the otherextreme, a popular econometric pack-age correctly solved nine, failed miser-ably four times, and for the remainingfourteen produced disastrous solutions.Absent a benchmark, a user would haveno idea that this package producedcompletely inaccurate “solutions” forover 50 percent of the test problems. Athird package produced eight disastroussolutions. Thus, the results of any studyinvolving nonlinear estimation that usedthe latter two packages must be calledinto question. It is not too far off themark to suggest, at least for econo-metric packages, that nonlinear estima-tion is today where linear estimationwas thirty years ago. As an aside, westrongly caution any economist whouses a spreadsheet package for econo-metric estimation to benchmark thepackage first (McCullough and Wilson1999).

Also related to nonlinear estimation,

we note that some packages report nodetails of their nonlinear estimation(not even the method), and others arevague about important details such aswhether numerical or analytic deriva-tives are used.16 Still others make nomention of how the standard errors arecomputed, (e.g., via the gradient or in-verse of the Hessian) despite the factthat this can affect inference. Somepackages conflate the minimization andnonlinear least squares routines, offer-ing only a single procedure. The basisfor this is that any minimization prob-lem can be written as a maximizationproblem merely by appending a minussign to the objective function. Typically,though, separate routines are used forminimization of a sum of squares andmaximization of likelihood functions.The primary reason is that the objectivefunction of a least squares problem hasa special form, and more efficient, spe-cialized algorithms have been devisedfor the special form. Some packagesoffer only one solver, though neithermodified Gauss-Newton nor Levenberg-Marquardt is sufficiently dependableto serve as the sole general nonlinearleast squares method in supportingsoftware systems (Kennedy and Gentle1980, p. 484).

All these problems are compoundedby the fact that virtually all journals donot require authors to reveal computa-tional details, not even the softwareused. Thus, even if we know that somesoftware package is defective, we haveno idea which published results arebased on defective software and might

16 Generally speaking, analytic derivatives aremore accurate than numerical derivatives. Nu-merical derivatives crafted for a particular prob-lem can be as accurate as analytic derivatives, butin a general purpose nonlinear solver, analyticderivatives are preferred (Jonathan Bard 1974,p. 117; J. E. Dennis and Robert Schnabel 1996,p. 106). See Janet Donaldson and Schnabel (1987)for Monte Carlo evidence.


well be wrong. More to the point, nordo we have any idea which results mightbe right. Even in rare instances when asoftware package is identified in an arti-cle, and the package is later discoveredto be defective in a way which affectsthe article’s results, updating the resultswith a reliable software package is prob-lematic. The reason is that virtually nojournals require authors to archiveeither their data or their code, and thisconstitutes an almost insurmountablebarrier to replication in the economicscience. Scientific content is not depen-dent merely on writing up a summary ofresults. Just as important is showing theprecise method by which the resultswere obtained and making this methodavailable for public scrutiny. To ourknowledge, only the journal Macro-economic Dynamics (MD) requires bothdata and code, while the Journal of Ap-plied Econometrics (JAE) requires dataand encourages code, and the Journalof Business and Economic Statistics(JBES) and The Economic Journal re-quire data; all four journals have ar-chives which can be accessed via theworldwide web. In the context of repli-cability and the advancement of sci-ence, the advantage of requiring codein addition to data is obvious. While itmay be trivial to use the archived codeto replicate the results in a publishedarticle, only if the code is available forinspection will other researchers havethe opportunity to find errors in thecode. Just as commercial software needsto be checked, so does the code whichunderlies published results.

The StRD, while the foremost collec-tion of benchmarks for econometricpurposes, is not the only one. We notealso the existence of suites of bench-marks for matrices (including inversion,multiplication, eigenvalue decomposi-tion, etc.). Given the prevalence of spe-cialized covariance matrices whose cal-

culation is not automated, and also theincreasing use of eigenvalue analysis fordynamical systems, these could profit-ably be applied to testing the matrix-handling facilities of econometric pack-ages. They also would be suitable fortesting matrix-based languages such asGAUSS and Ox. The “Matrix Market” ofR. Boisvert et al. (1997) has over 500matrices, each with its own web page,though many of them are quite special-ized and rarely encountered in econom-ics. A more managable number, about70, with more relevance to economics isHigham’s (1991, 1995) “TESTMAT” col-lection. We are unaware of any econo-metric software review that assessesthe accuracy of matrix calculations. Wehave applied some of Higham’s testmatrices to some econometric pack-ages and found errors in inversion andeigenvalue routines, which suggests theneed for such assessments.

Checking the basic estimation ma-chinery is only the beginning. That thelinear or nonlinear solver “works” is noassurance that it has been properly im-plemented in specialized proceduressuch as FIML and GARCH. Such spe-cialized procedures need benchmarks,of which there is a marked dearth.

4. The Need for Specific Benchmarks

Discrepancies abound between dif-ferent software packages, and many ofus are familiar with getting two answersto the same problem using two differentpackages. Here we do not refer to mi-nor differences due to rounding error,hardware configuration, or operatingsystem, but major differences due tosome unknown reason, such as the dis-crepancy between FIML results pre-sented in our Table 1. All three sets ofanswers cannot be right; and in factthey are not: the correct answer is byCalzolari and Panattoni (1988), who


provided a benchmark for FIML, in-cluding the asymptotic covariancematrix thereof computed by variousmethods. See Julian Silk (1996) for adiscussion.

An unfortunate consequence of twodifferent packages providing two differ-ent answers to the same problem isa lack of replicability in economicresearch. In their important article,William Dewald, Jerry Thursby andRichard Anderson (1986) showed thatreplicating economic research is a nearlyimpossible task. Part of the replicabilityproblem is that, in addition to the dataand the code used to run a commer-cial package, an aspiring replicator alsoneeds the same commercial package asthe original author. Yet, a researchershould be able to expect that FIML es-timates of Klein’s Model I are not de-pendent on the package used. Clearly,benchmarks, replication, and softwarereliability are fundamentally related.

The advantage of writing software tomeet existing benchmarks is obvious.Yet even when benchmarks exist, feweconometric software developers pro-vide users with benchmarks results, oreven provide sufficient information onthe algorithm that a user can ascertainwhether or not the procedure is likelyto produce correct results. For example,many packages have “autocorrelation” and“partial autocorrelation” procedures which,given an input series, produce thelagged autocorrelations and lagged partialautocorrelations. These are critical forchoosing the number of AR and MAlags for Box-Jenkins modelling. Yet rarelyis the user told how these results arecalculated, and this is important infor-mation, since some poor algorithms arein widespread use. An example follows.

Let x0 be a series and let x1 and x2be its first and second lags. Let ρ1 bethe first-order autocorrelation coeffi-cient (the simple correlation between

x0 and x1) and let π2 be the second-order partial autocorrelation coefficient(the correlation between x0 and x2 afterthe effect of x1 on x0 is removed). Bydefinition the first-order partial auto-correlation coefficient is equal to thefirst-order autocorrelation coefficient,i.e., π1 ≡ ρ1. There are many ways to cal-culate the autocorrelation and partialautocorrelation coefficients, and someare better than others. With respect tocalculation of partial autocorrelation co-efficients, George Box and GwilymJenkins (1976, p. 65) note that the re-gression method is more accurate thanmethods based on the Yule-Walkerequations (YWE). M. Priestley (1981,pp. 350–52) lists four methods in de-creasing order of accuracy: exact maxi-mum likelihood, conditional maximumlikelihood (CML), approximate leastsquares, and the YWE. In general, theYWE are to be eschewed for economicdata (Dag Tjostheim and Jostein Paul-sen 1983), yet they are frequently em-ployed in econometric software andoften the only method mentioned intime series and econometrics texts. It isinteresting to compare the YWE toCML.

Define a series x0 = {1, 2, . . . 10}.It is linear without errors, so ρ1 ≡ 1.Once the effect of x1 is removed fromx0, there is nothing left to explain, soπ2 ≡ 0. These intuitive results can beverified by calculation from first princi-ples, e.g., elementary formulae for cor-relation and partial correlation (AlanStuart and J. Ord, 1991, p. 1012). Astandard implementation of the YWEyields ρ1 = 0 .70 and π2 = −0.153, whileCML yields ρ1 = 1.0 and π2 = 0. The ef-fect of inaccurate estimation of partialautocorrelation coefficients on ARMAmodel identification needs no elabora-tion. Accurate computation of partialautocorrelation coefficients is discussedin McCullough (1998a).


As another example of the need forbenchmarking specialized procedures,consider the staggering number of pub-lished papers using GARCH. It is well-known that different packages producedifferent answers to the same problem.While until recently there was nobenchmark for GARCH, this problemwas all but solved by GabrielleFiorentini, Calzolari and Panattoni(1996, hereafter FCP). They providedcomplete and usable closed-form ex-pressions for the gradient and Hessianof a univariate GARCH model (recallthat analytic derivatives are more accu-rate than the numerical derivativescommonly used in GARCH proce-dures). They also provided FORTRANcode for estimating such a model. Usingthis code on a well-known GARCHdataset constitutes a benchmark.

The JBES archive has the T = 1974observations on the daily percentagenominal returns for the Deutsche-mark/British pound exchange rate fromTim Bollerslev and Eric Ghysels (1996,hereafter BG). The JAE archive has theFCP FORTRAN program. McCulloughand Renfro (1999) used these data andcode to produce the FCP GARCHbenchmark and applied it to seven pack-ages, “Y1” through “Y7.” The model isyt = µ + ∈τ, where ∈τ

Ψt−1 ~ N(0, ht) andht = α0 + α1∈τ−1

2 + β1ht−1. (BG estimated adifferent parameterization of this model;their results are consistent with bench-mark.) Since this model must be esti-mated by nonlinear methods, startingvalues for the coefficients and a methodfor initializing the series ht and ∈τ

2 attime t = 0 must be specified. FCP, BGand many others have used the initiali-zation h0 = ∈0

2 = SSR ⁄ T (SSR is the sumof squared residuals).

Surprisingly, not all packages couldeven begin to estimate this simpleGARCH model which has been much-used in applied work. Package Y2 allows

the user to control neither the startingvalues nor the initialization of h0 and∈0

2. Packages Y1, Y4, and Y6 allow theuser to specify the starting values, butoffer no control over the initialization.Packages Y3 and Y5, which use ana-lytic rather numerical first derivatives,achieved two or three digits of accuracyfor the coefficients (the accuracy of thestandard error estimates is another mat-ter altogether). Whether this degree ofaccuracy is suitable as input for an op-tion pricing model is unknown. Onlyone package, Y7, hit the benchmark toseveral digits of accuracy, and did so forboth coefficients and standard errors.This package uses analytic first andsecond derivatives.

In all but two cases, it was necessaryto contact the developer to determineat least one and sometimes all of thefollowing: the precise method of initial-izing h0 and ∈0

2; the type of derivativesused; and the method of calculatingstandard errors (there are at least fiveways). That these matters are not ex-plicitly addressed in the documentationis a serious omission. That four of sevenGARCH procedures cannot accommo-date a simple and popular GARCHmodel suggests that some vendors takean idiosyncratic approach to program-ming. For these four packages, the userhas very little choice as to the condi-tional likelihood to be maximized. Suchapproaches to documentation and pro-gramming are an impediment to goodresearch.

Some procedures are easy to bench-mark, as (partial) autocorrelation coeffi-cients and two-stage least squares.These, however, are the exception.Generally, devising a benchmark is anarduous process, as the work of Calzo-lari et al. and Fiorentini et al. attests.J. Dongarra and G. Stewart (1984)noted, while describing their pioneeringefforts to test only linear algebra


routines, “In some cases, the test pro-grams were harder to design than theprograms they tested.” The generalstate of affairs is that there are nobenchmarks for most specialized econo-metric procedures. Perusing the list ofcommands for any econometric soft-ware package yields many proceduresfor which we were unable to find abenchmark and for which we found dis-crepancies between packages: linear es-timation with AR(1) errors, estimationof an ARMA model, Kalman filtering,limited dependent variables models, SURestimation, three-stage least squares, andso on. Absent benchmarks, we cannot besure that an econometric package givesus reliable answers. Estimation, how-ever, is not the only part of a packagewhose reliability needs to be verified.

5. Testing the Random NumberGenerator

Only a few years ago, the use of ran-dom numbers in economics was the ar-cane province of a few specialists. Withthe recent explosion in computingpower and concomitant theoretical ad-vances, this is no longer the case. Simu-lation (Christian Gourierioux and AlainMontfort 1996), bootstrapping (Vinod1993), and Bayesian econometrics arebut three econometric methods whichmake extensive use of the random num-ber generator (RNG). The recent textby Gentle (1998a) is a useful reference.Good introductions to RNGs areStephen Park and Keith Miller (1988)and Press et al. (1992, ch. 7), both ofwhich stress that an RNG should not betreated by the user as a “black box”—the user who chooses to remain unin-formed of the properties of his RNGdoes so at his own peril. Yet, we exam-ined several econometric packages andfound that most gave no informationwhatsoever about the RNG employed,

not even a citation to the article inwhich the RNG was published. Econo-metric software developers typically dopresent the RNG as a black box.

As is well known, the random num-bers produced by a computer are notrandom, but pseudo-random, i.e., theyare produced deterministically, but (ifthe RNG works well) appear to be ran-dom. One popular RNG is the linearcongruential generator (LCG), whichfor a large integer m produces a se-quence of numbers I1, I2, I3, . . . between0 and m − 1 via a recursion

Ij + 1 = aIj + c (mod m) (10)based on an initial “seed” I0, which oftenis supplied by the user. Eventually, thesequence repeats itself with a period nogreater than m, and exactly equal to m ifthe parameters a, c, and m are carefullychosen. Typically, output is restricted tothe interval (0,1) by returning Ij + 1 ⁄ m.This sequence should be uniformly dis-tributed. For the LCG, though not forall RNGs, the same sequence will beproduced with the same seed. This re-producibility is a desirable feature fordebugging purposes. When debugging aprogram, it is necessary to examine theinput which caused the bug, and if theRNG is not reproducible, the input nolonger is available.

Ripley (1990) lists the desirable char-acteristics of an RNG. In short, an RNGshould:

1. be reproducible from a simplyspecified starting point;

2. have a very long period;3. produce numbers that are a very

good approximation to a uniformdistribution;

4. produce numbers that are veryclose to independent in a moderatenumber of dimensions.

We already addressed the first point; weaddress the remaining three in turn.


Poor choices for the parameters ofany RNG can drastically impair its qual-ity, causing the period to be unneces-sarily short or even producing non-uniform random numbers. For example,the infamous RANDU (IBM 1968, p.77) generator distributed with the IBM360 mainframe computer had such poorchoices for the parameters that DonaldKnuth (1997, p. 188) called it “reallyhorrible.” RANDU failed even simpletests of randomness. Complicating mat-ters for users, RANDU was widelyimitated, and even recommended intextbooks long after its faults were well-known (Park and Miller 1988, p. 1198).As another example, Sawitzki (1985) de-scribes the RNG for IBM PC BASIC asbeing not only decidedly non-uniform,but having a period of only 216. “Short”is a relative term, and periods whichwere of acceptable length only a fewyears ago might now be unacceptable,given the recent surge in the demandfor random numbers. Knuth (1997, p.195) suggests that the period should beat least one thousand times larger thanthe number of values used, i.e., if p isthe period and n is the number of callsto the RNG, then p > 1000n. The rea-son is that the discrepancy between anRNG’s output over its entire period andtrue randomness can be large, espe-cially for linear-type generators, so atmost a fraction of the RNG’s outputshould be used. Therefore, a p ≈ 231

generator should be used for no morethan 2.1 million calls. Yet a modestdouble bootstrap (see McCullough andVinod 1998) with 1999 first stage and250 second stage resamples requiresthat the residual vector be resampledhalf a million times, so that such anRNG can support a sample size of nomore than four observations.

The situation is even more stark formore computationally intensive applica-tions such as Bayesian inference with

numerical integration, Monte Carlostudies, and calculation of non-standardtest statistics. John Geweke and Mi-chael Keane (1997) used one billionrandom numbers to conduct Bayesianinference via Gibbs sampling. TheMonte Carlo of the double bootstrap byDavid Letson and McCullough (1998)required 45 billion calls to the RNG.MacKinnon (1996) used more than 100billion to tabulate the distribution func-tions of unit root and cointegration sta-tistics. With such considerations inmind, Geweke (1996) showed how toset up an RNG with p = 2100. One experton random numbers has written (PierreL’Ecuyer 1992, p. 306) “No generatorshould be used for any serious purposeif its period (or, at least, a lower boundon it) is unknown.” A researcher needsto know about the RNG in his econo-metric package. Simply having a cita-tion for the RNG is insufficient, be-cause faulty RNGs are still proposed injournal articles (L’Ecuyer 1994), andsome RNGs with bad properties are inwidespread use.

Not only should an RNG have a longperiod, it should also pass statisticaltests for randomness, because corre-lated output can wreak havoc. For ex-ample, first-order autocorrelation of therandom numbers might not be a prob-lem if Monte Carlo methods are used toevaluate a one-dimensional integral, butalmost certainly would be disastrous forevaluating a two-dimensional integral.The idea behind testing is to seewhether the RNG’s output is signifi-cantly different from the behavior of atruly independently and identically dis-tributed sequence of uniform randomvariables. Strictly speaking, the null hy-pothesis that the output from an RNGis uniform, i.i.d. is false, and no singlebest test exists. Yet, since what iswanted from an RNG is the appearanceof randomness, ceteris paribus, we


recommend an RNG that passes a giventest for uniformity to one that fails it.Moreover, one test is insufficient. Sincethere are many possible departuresfrom randomness, many tests should beapplied.

The following simple example de-scribes testing a sequence of numbersfor randomness, which often involvestwo-level testing. Divide the unit inter-val into 30 equal bins. Make three hun-dred calls to the RNG and place each inits appropriate bin. If the RNG doesprovide truly uniform numbers, thereshould be 10 numbers in each bin, plusor minus random deviations. This caneasily be tested by using the χ2 compari-son of the actual and expected numberin each bin. Repeat this test 10 times,yielding 10 χ2 statistics which can thenbe compared to a theoretical χ2 distrib-ution with 29 degrees of freedom us-ing the Kolmogorov-Smirnoff (K-S)test. This will test whether the RNGproduces numbers that are approxi-mately uniformly distributed. Ofcourse, the number of bins can be in-creased, as well as the number of callsto the RNG, to provide finer discrimi-nation. The reason for running 10 χ2

tests and then using the K-S test (twolevels) instead of conducting one largeχ2 on 3000 random numbers (one level)is that this increases the power of thetest (L’Ecuyer 1994, sec. 4.5.1), whichis desirable since the null hypothesis isfalse.

The above procedure tests how wellthe RNG can fill the real line, but can itfill a square? To test this, make asquare with 30 equal intervals on eachside, for a total of 900 bins. Draw 9000pairs of random numbers, which shouldput 10 in each bin. Uniformity can betested by comparing actual and ex-pected numbers for each bin with aχ2(899) distribution. Then, the test canbe repeated 10 times and those 10 χ2

statistics subjected to a K-S test. Manysimple RNGs will pass the extension tothree dimensions; RANDU will not.While RANDU’s output appeared ran-dom in one and two dimensions, itscorrelation showed up strongly in threedimensions, i.e., RANDU’s output wasnot uncorrelated in a moderate numberof dimensions. Extensions to still higherdimensions are easily generalized. R.Coveyou and R. MacPherson (1967)first observed and George Marsaglia(1968) explicitly discussed that LCGshave a lattice structure. That is, in 3–space, for example, the points (z1, z2, z3),(z2, z3, z4) and (z3, z4, z5) all fall on a finitenumber of planes. Therefore, the num-ber of planes should be large and theyshould be close together. Hence, it isimportant to test for correlation inhigher dimensions, though for practicalreasons the number of dimensions islimited to about eight.

Batteries of such tests are offered byMichael Stephens (1986) and SheTezuka (1995), though the first stan-dard battery of tests of which we areaware was given in the 1981 edition ofKnuth (1997), with FORTRAN and Cimplementations by E. Dudewicz andT. Ralley (1981) and Jerry Dwyer andK. Williams (1996), respectively. Mar-saglia (1985) noted that Knuth’s testswere not very stringent. This is espe-cially true today, as the scale ofcomputer simulations has increasedconcomitantly with computing power.In particular, some RNGs that pass theKnuth tests frequently produce signifi-cant biases when used in large-scalesimulations. To remedy this deficiency,Marsaglia (1996) produced “Diehard: ABattery of Tests of Randomness,” alsoknown as “Marsaglia’s Diehard Tests,”for which automated executable filesfor DOS and LINUX are available, aswell as source files in C. The executableoperates on a file of about three million


random numbers created by an RNG.While three million random numbers willaccommodate only the smallest and sim-plest of Monte Carlo studies, as a firstassessment of an RNG, DIEHARD hasease of use to recommend it.17 For test-ing more than three million, the sourcefiles must be modified and recompiled.Names and descriptions of the tests areproduced as part of the output. Mar-saglia is planning a revision of DIE-HARD. L’Ecuyer’s TESTU01 programis in the testing stage, and is scheduledto be released in the next year or so.

The developer of an RNG often willsubject the RNG to several tests andnote this in the article in which theRNG is published. However, implemen-tation of an RNG can be difficult, so itis imperative that the developer’s im-plementation be tested. Our own infor-mal application of DIEHARD to a feweconometric packages found that whilesome passed, some did not. Still, eventhose that passed would be unsuitablefor large-scale applications. Impor-tantly, not one package mentioned theperiod of its RNG.

6. Statistical Distributions

Some persons might think that com-puter programs are more accurate thanstatistical tables, since the statisticaltables only report a few decimals andcomputer programs report several, butsuch is not necessarily the case. Imag-ine an economist conducting a Chowtest with 3 restrictions on 126 observa-tions. Suppose he calculated a teststatistic of 4.0. He might well have aninterest in the F(3,120) distribution. Ifhe uses Package “X4” to calculate the 1percent critical value, he obtains 4.12.

Thus, he does not reject the null—untila referee points out that a standard sta-tistical table gives a critical value of3.95, and so the null is rejected alongwith the article. As another example,the documentation for Package “X5”says that its Student’s-t function returns“the probability that a t-statistic with ddegrees of freedom exceeds X. LettingX = 1.345 and choosing d = 14, a sta-tistical table shows that the upper tail is0.10. Yet Package “X5” returns 0.2000—the value for a two-tail test. Erroneousinference awaits the user who trusts suchdocumentation. Leo Knüsel (1995) hasdocumented the inaccuracy of statisticaldistributions in GAUSS v3.2.6, and alsothat these same inaccuracies were notcorrected in release v3.2.13 (Knüsel1996). Even more serious inaccuracies wererevealed in the statistical distributionsof Excel97 (Knüsel 1998). McCullough(1999b) used Knüsel’s (1989) ELV pro-gram to document similar inaccuraciesin econometric packages.

Statistical distributions have two fun-damental applications: calculating p-val-ues and calculating critical values forsome level of significance. Let F(x) bethe cumulative distribution function(cdf) of the random variable X whoseprobability density function is f(x). Thefirst step in calculating a (one-sided)p-value for a calculated statistic, x~, is todetermine p~ = P(X ≤ x~) = F(x~). Usually,there is no closed-form expression avail-able, and so the problem becomes oneof approximating an integral

F(x~) = ∫ −∞

x~

f(t)dt(11)

The problem of finding a critical valueamounts to approximating the inverse ofthe above integral, i.e., finding xc suchthat

xc = F−1(1 − p) (12)

for some specified value of p.

17 The documentation for DIEHARD is sketchy.Persons wishing to use DIEHARD should consultMcCullough (1998, 1999) for implementationdetails.


It is well-known that the cdf of thestandard normal distribution must beevaluated numerically. Numerical evalua-tion, or integral approximation, is rifewith technical difficulties. Improper in-tegrals (with limits of −∞ or + ∞ ) mustbe dealt with, and integrals with verticalasymptotes and other such stumblingblocks must be overcome. General ex-positions on these details are availablein Kennedy and Gentle (1980, ch. 5)and Thisted (1988, ch. 5), with discus-sions of specific distributions in Nor-man Johnson, Samuel Kotz, and N.Balakrishnan (1994, 1995). Algorithmicerror is especially critical here. BarryBrown and Lawrence Levy (1994) ex-amined several algorithms for the in-complete beta distribution (which is thebasis of the F-distribution, among oth-ers) and found only one of them to bereliable. B. Bablok (1988) uncoveredseveral errors in commonly used formu-lae for non-central statistics. As an ex-ample, a procedure for the non-centralF based on Eq. 26.6.18 of Milton Abra-mowitz and Irene Stegun (1972) willreturn incorrect results.

It is sometimes suggested that statis-tical distributions need be accurate onlyto two or three digits. There are two ob-jections to this. First, a package that atbest gets two or three digits is morelikely to provide zero digits than a pack-age that at best gets several digits. Sec-ond, there are many applications forwhich two or three digits are insuffi-cient. Among these are Edgeworth ex-pansions, Monte Carlo Markov chains,size and power calculations, quantile-quantile plots, censored and truncatedregressions, and many more. Thesemethods sometimes require accurateevaluation of probabilities as small as1.E-12 and smaller. Knüsel (1995) sug-gested that the minimum requirementfor statistical distributions is that theyshould be accurate to all displayed

digits, and observed that this has threeimplications:

• If a probability is smaller than 0.5E-4 and the program prints out0.0000, then the program is correct.

• If a probability is zero and the pro-gram prints 0.76543E-11, then theprogram is incorrect.

• If a probability is 3.456E-10 and theprogram prints 3.401E-10, this isincorrect, for the result is notcorrect as printed.

This last point is easiest to see when itis remembered that relative error, andnot absolute error, is the relevant crite-rion. While 3.401E-10 − 3.456E-10 =-0.55E-10 is a small number, the rela-tive error is (|3.401E-10 − 3.456E-10|)/3.456E-10 ≈ 19%, which is not small.

In order to assess accuracy, accurateresults must be obtainable. Here,Knüsel’s (1989) ELV, available as aDOS executable, and Brown’s (1997)DCDFLIB, available in FORTRAN77and C tar files, can be of use. A basicsequence of percentiles (BSP) {0.0001,0.001, 0.01, 0.1, 0.2, . . . , 0.9, 0.99,0.999, 0.9999 can be profitably em-ployed. Probabilities outside this rangeare referred to as the “extreme tails.”Use ELV or DCDFLIB to generatecritical values for the BSP. Feed thesecritical values into the econometricpackage to see whether the correct tailvalues are returned. If they are, thenthe distribution seems accurate and theextent of its accuracy can be assessed byusing ELV or DCDFLIB to answer thefollowing questions:

• How far can the degrees of freedombe extended before the programbreaks?

• Are the results still sensible for ex-treme parameters or is nonsense out-put, such as negative probabilities,produced?


• Does the program correctly calcu-late very small probabilities? (Thisamounts to checking the “extremetails” to determine the point at whichthe package no longer producesaccurate answers.)

• Does the program give very smallresults such as 0.76345E-37 thatare completely wrong?

To test inverse functions, feed theBSP into the inverse function and seewhether the exact critical values are re-turned. A common problem which canbe observed is that one tail is more ac-curate than the other. This is becausemany packages do not compute upperand lower tails separately, but computeonly one tail and find the other bycomplementation. When a probabilityis near zero or unity, this can lead tocancellation error if upper and lowertails are not computed separately. Con-sider determining p = P(X > 265) whereX is chi-square with 100 degrees offreedom. Many packages, “Package X3”included, will compute only P(X ≤ 265),so p must be determined by comple-mentation, p = 1 − P(X ≤ 265) which pro-duces p = 0.11102E−15 rather than thecorrect p = 7.2119E-17. The reason isthat P(X ≤ 265) is very close to unity,and so the subtraction necessary toobtain p is contaminated by cancella-tion error to the point that the re-sult has no accurate digits. Packagesshould compute upper and lower tailsseparately.

7. Conclusions and Recommendations

We have presented several cases ofserious numerical discrepancies be-tween econometric packages, includingFIML, GARCH, (partial) autocorrela-tion, and Cochrane-Orcutt corrections.We have suggested that inadequatenonlinear solvers are not uncommon.

There can be no doubt that manymore such discrepancies exist, andthey can be attributed in part to a lackof benchmarks. We have also shownthat random number generators andstatistical distributions can suffer fromnumerical problems. These problemscan be solved only by a concerted ef-fort on the part of users, developers,and the economics profession as awhole.

Users should recognize that accuracyis at least as important as either speedor user-friendliness. This does not meaneach user must immediately visit theStRD website and commence bench-marking his package. At the very least itdoes imply that users should be cogni-zant of the fact that writing accuratesoftware is more demanding than writ-ing either fast or friendly software. De-velopers need to benchmark their soft-ware. This does not mean developersshould suspend implementation of newprocedures and devote all their re-sources to developing needed bench-marks. At the very least, it means thatdevelopers should make use of existingbenchmarks, such as StRD, TESTMAT,and others which will be developed inthe future. When a benchmark existsfor a procedure, as in the case of FIML,the developer should note in the man-ual that the procedure achieves thebenchmark, or else explain why it doesnot. Moreover, developers need todocument their procedures better. AsLovell and Selover (1994) discovered intheir consideration of AR(1) proce-dures, there are many admissible imple-mentations of AR(1) procedures, anddevelopers often do not describe pre-cisely which implementation they use.Documentation for econometric soft-ware can only be improved by payingproper attention to numerical and algo-rithmic matters. RNGs need to be docu-mented, including the type of RNG, its


period, and tests the developer’s imple-mentation of the RNG has passed.Statistical distributions also need to bebetter documented, and to have theiraccuracy verified.

Accurate econometric software is notjust the responsibility of the developers,it requires active participation by theprofession. Obviously, creating bench-marks is beyond the capability of anyone developer, or even the small collec-tion of developers. So, too, is the matterof deciding which procedures are mostin need of being benchmarked. A strongsuggestion that a method needs to bebenchmarked is that two packages givetwo different answers to the same prob-lem.18 At present, two different pack-ages are not often used to solve thesame problem, but the widespread useof archives by journals would changethis. Therefore, journal editors shouldrequire that authors identify their soft-ware (including version number) andmake their code and their data widelyavailable via archives. In addition to un-covering discrepancies between pack-ages, this will provide developers andusers more of an incentive to rely uponbenchmarked procedures, and thus at-tenuate the problem of two packagesproviding two different answers to thesame problem.

Why journal readers do not demandarchives is the same issue as why soft-ware users do not demand evidence ofaccuracy. Yet, just as results from a soft-ware package that passes benchmarktests are preferable to results from asoftware package that fails benchmarktests, so, too, studies from a journalwhose results are capable of being veri-fied are preferable to studies from ajournal whose results are not verifiable.

Regrettably, neither benchmarking byvendors nor archiving by journals is yeta common practice.

Two related events are worth noting.First, following the publication of De-wald, Thursby, and Anderson (1986),the Journal of Money, Credit and Bank-ing began requesting data from itsauthors, and the NSF established anarchive for the storage and distributionof authors’ data at the University ofMichigan’s Interuniversity Consortiumfor Political and Social Research. TheNSF economics program invited journaleditors to request that authors placetheir data in this archive; 22 editorsdeclined this invitation (Anderson andDewald 1994). Second, in 1993, theJMCB discontinued its practice of re-questing data from authors. However,both events occurred before the adventof the worldwide web.

While some journals have “policies”that authors should make available theirdata and code, there is no penalty at-tached to an author’s refusal to comply,and these policies are honored moreoften in the breach. The results ofDewald, Thursby, and Anderson (1986),recently revisited by Anderson and De-wald (1994), showed that most authorscould not or would not honor a requestfor data and code. The disincentives forauthors to comply with such requestsare discussed in Susan Feigenbaum andDavid Levy (1993, 1994). An archivecompletely eliminates the problem ofobtaining data and code from authors.The MD archive, in fact, requires thatthe code supplied be able to reproducethe reported results. This latter require-ment is necessary to ensure the replica-bility of research, but it is also neces-sary for the problem of determiningwhether two packages really do producedifferent answers to the same problem.A reported discrepancy between twopackages might be due to a coding

18 This does necessarily mean that either pack-age is incorrect. It may mean only that the docu-mentation is poor and the two packages are reallydoing two different things.


error, but this cannot be determined ifthe code is not available.

That the methods by which resultsare obtained be open to the scrutiny ofother researchers is a higher standardof quality than is usually found ineconomics, but it is a common standardin other sciences. Economic researchwould generally benefit if this standardwas more widely adopted, and so wouldsoftware.19 We observe that someauthors maintain this higher standardby making use of personal archiveswhen they publish in journals thathave no archive. To give but threeexamples, the articles by Bronwyn Hall(1993), James Hamilton (1997), andMark Watson (1993) all have dataand code archived at the authors’homepages.

Until journal archives are common-place, many things can be done. Re-searchers should ensure that their soft-ware is up to the task of producingreplicable research. Referees can askauthors to provide computational details,with reference to making sure that re-sults are reproducible. Software revieweditors can do at least two things. Whena developer provides no evidence thathis software meets existing benchmarks,the editor can suggest that the reviewerapply known benchmarks. Second, soft-ware review editors can encouragereviewers to propose new benchmarks.The more quantitative journals candevote space to publication of more so-phisticated benchmarks or to discussionof software; for example, the Journal ofEconomic and Social Measurement hasa special issue on econometric softwareforthcoming.

Reliable econometric software is the

joint responsibility of the users, devel-opers, and the profession. We hope theday is not too far off when the adver-tisements for econometric softwareread, “Obtains the correct solution in xseconds,” rather than the current, “Ob-tains a solution in y seconds,” and thatusers will appreciate this distinction,even though x may be greater than y.

REFERENCES

Abramowitz, Milton and Irene A. Stegun. 1972.Handbook of Mathematical Functions. 9thprinting. NY: Dover.

Anderson, Richard G. and William G. Dewald.1994. “Scientific Standards in Applied Econom-ics a Decade After the Journal of Money, Creditand Banking Project,” Fed. Res. Bank St. LouisRev., 76, pp. 79–83.

Bablok, B. 1988. Numerische Berechnungnichtzentraler Statist.her verteilungen. Diplomathesis, Dept. Statistics, U. Munich.

Bankhofer, Udo and Andreas Hilbert. 1997.“Statistical Software for Windows: A MarketSurvey,” Statist. Pap., 38, pp. 393–407.

Bard, Jonathan. 1974. Nonlinear ParameterEstimation. NY: Academic Press.

Beaton, Albert E.; Donald B. Rubin, and John L.Barone. 1976. “The Acceptability of RegressionSolutions: Another Look at Computational Ac-curacy,” J. Amer. Statist. Assoc., 71:353, pp.158–68.

Bernhard, G.; M. Herbold, and W. Meyers. 1988.“Investigation on the Reliability of Some Ele-mentary Nonparametric Methods in StatisticalAnalysis Systems,” Statist. Software Newsletter,14, pp. 19–26.

Berndt, Ernst. 1990. The Practice of Econo-metrics. Reading, MA: Addison-Wesley.

Boisvert, R.; R. Pozo, K. Remington, R. Barrett,and J. Dongarra. 1997. “Matrix Market: AWeb Resource for Test Matrix Collections,” inThe Quality of Numerical Software: Assessmentand Enhancement. Ronald Boisvert, ed.London: Chapman and Hall, pp. 125–37.(http://math.nist.gov/MatrixMarket)

Bollserlev, Tim and Eric Ghysels. 1996. “PeriodicAutoregressive Conditional Heteroscedasticity,”J. Bus. Econ. Statist., 14:2, pp. 139–51.

Box, George E. P. and Gwilym Jenkins. 1976.Time Series Analysis: Forecasting and Control.San Francisco: Holden-Day.

Brown, Barry W. 1998. DCDFLIB v1.1 (DoublePrecision Cumulative Distribution FunctionLIBrary) (ftp://odin.mdacc.tmc.edu/pub/source).

Brown, Barry W. and Lawrence B. Levy. 1994.“Certification of Algorithm 708: SignificantDigit Computation of the Incomplete Beta,”ACM Transactions on Math. Software, 20:3, pp.393–97.

19 The journals themselves might also benefit.Tauchen (1993) has argued that readers of jour-nals should be interested in data and code, andthat when they are made available a journal’sprestige and circulation should increase.


Campbell, John Y. and N. Gregory Mankiw. 1987.“Are Output Fluctuations Transitory?” Quart. J.Econ., 102, pp. 857–80.

Calzolari, Giorgio and Lorenzo Panattoni. 1988.“Alternative Estimators of FIML CovarianceMatrix: A Monte Carlo Study,” Econometrica,56:3, pp. 701–14.

Chaitin-Chatelin, Francoise and Valérie Frayssé.1996. Lectures on Finite Precision Computa-tions. Philadelphia: SIAM.

Coveyou, R. and R. MacPherson. 1967. “FourierAnalysis of Uniform Random Number Gener-ators,” J. ACM, 14:1, pp. 100–19.

Dahlquist, Germund and Ake Björck. 1974.Numerical Methods. New York: Prentice-Hall.

Davidson, Russell and James G. MacKinnon. 1993.Estimation and Inference in Econometrics. NewYork: Oxford.

Demmel, James. 1981. “Effects of Underflow onSolving Linear Systems,” manuscript, ComputerScience Div., U. C. Berkeley.

———. 1984. “Underflow and the Reliability ofNumerical Software,” SIAM J. Sci. Statist.Computing, 5:4, pp. 887–919.

Dennis, J. E. and Robert B. Schnabel. 1996. Nu-merical Methods for Unconstrained Optimiza-tion. Philadelphia: SIAM Press.

Dewald, William G.; Jerry G. Thursby, and Rich-ard G. Anderson. 1986. “Replication in Empiri-cal Economics: The Journal of Money, Creditand Banking Project,” Amer. Econ. Rev., 76:4,pp. 587–603.

Dongarra, J. J. and G. W. Stewart. 1984. “LIN-PACK––Package for Solving Linear Systems,”in Sources and Development of MathematicalSoftware. W. Cowell, ed. NY: Prentice-Hall, pp.20–48.

Dudewicz, E. and T. Ralley. 1981. The Handbookof Random Number Generation and Testingwith TESTRAND Computer Code. Columbus,OH: American Sciences Press.

Dwyer, Jerry and K. B. Williams. 1996. “TestingRandom Number Generators,” C/C++ UsersJ., June, pp. 39–48.

Eddy, W. F.; S. E. Howe, B. F. Ryan, R. F. Teitel,and F. Young. 1991. The Future of StatisticalSoftware: Proceedings of a Forum. Wash., DC:Nat. Academy Press.

Elliott, Alan C.; Joan S. Reisch, and Nancy P.Campbell. 1989. “Benchmark Data Sets forEvaluating Microcomputer Statistical Pro-grams,” Collegiate Microcomputer, 7:4, pp.289–99.

Feigenbaum, Susan and David M. Levy. 1993.“The Market for (Ir)Reproducible Econo-metrics,” Soc. Epistemology, 7:3, pp. 243–44.

———. 1994. “The Self-Enforcement Mechanismin Science,” manuscript, presented at the AEAmeetings, Boston.

Fiorentini, Gabriele; Giorgio Calzolari, andLorenzo Panattoni. 1996. “Analytic Derivativesand the Computation of GARCH Estimates,”J. Appl. Econometrics, 11:4, pp. 399–417.

Ford, B. and J. Rice. 1994. “Rationale for the IFIP

Working Conference: The Quality of NumericalSoftware: Assessment and Enhancement,”IFIP–WG2.5 doc., Raleigh.

Forsythe, George E. 1970. “Pitfalls in Computa-tion, or Why a Math Book Isn’t Enough,” Amer.Math. Monthly, 77, pp. 931–56.

Fox, L. 1971. “How To Get Meaningless Answersin Scientific Computation (and What To Doabout It),” IMA Bul., 7:10, pp. 296–302.

Francis, Ivor. 1981. Statistical Software: AComparative Review. NY: North-Holland.

———. 1983. “A Survey of Statistical Software,”Computational Statist., 1:1, pp. 17–27.

Francis, Ivor; Richard M. Heiberger, and Paul F.Velleman. 1975. “Criteria and Considerations inthe Evaluation of Statistical Software,” Amer.Statistician, 29:1, pp. 52–56.

Gentle, James E. 1998. Numerical Linear Algebrawith Applications in Statistics. NY: Springer.

———. 1998a. Random Number Generation andMonte Carlo Methods. NY: Springer.

Geweke, John. 1996. “Monte Carlo Simulation andNumerical Integration,” in Handbook ofComputational Economics, vol. 1. Hans M.Amman, David A. Kendrick, and John Rust,eds. Amsterdam: North-Holland, pp. 731–800.

Geweke, John and Michael Keane. 1997. “An Em-pirical Analysis of Income Dynamics AmongMen in the PSID,” Staff Rep. 233, Fed. Res.Bank Minneapolis.

Goldberg, David. 1991. “What Every ComputerScientist Should Know About Floating-PointArithmetic,” ACM Computing Surveys, 23:1,pp. 5–48.

Goldberg, I. B. 1967. “27 Bits Are Not Enough for8–Digit Accuracy,” Communications of theACM, 10:2, pp. 105–106.

Gourieroux, Christian and Alain Montfort. 1996.Simulation Based Methods in Econometrics.NY: Oxford U. Press.

Greene, William H. 1997. Econometric Analysis,3e. NY: MacMillan.

Hall, Bronwyn. 1993. “The Stock Market’s Valu-ation of Research and Development InvestmentDuring the 1980s,” Amer. Econ. Rev., 83:2, pp.259–64.

Hamilton, James D. 1997. “Measuring the Liquid-ity Effect,” Amer. Econ. Rev., 87:1, pp. 80–97.

Hammarling, Sven. 1985. “The Singular Value De-composition in Multivariate Statist.,” ACM Spe-cial Interest Group on Numerical Math., 20, pp.2–25.

Higham, Nicholas J. 1991. “Algorithm 694: A Col-lection of Test Matrices in MATLAB,” ACM Trans-actions on Math. Software, 17:3, pp. 289–305.

———. 1995. “The Test Matrix Toolbox for MAT-LAB v3.0,” Numerical Analysis Rep. 276, U.Manchester, England. (http://www.ma.man.ac.uk/~higham/testmat.html).

———. 1996. Accuracy and Stability of NumericalAlgorithms. Philadelphia: SIAM.

IBM. 1968. System/360 Scientific SubroutinePackage, Version III, Programmer’s Manual.White Plains, NY.


IEEE. 1985. Standard for Binary Floating-PointArithmetic, ANSI/IEEE Standard 754–1985. NY:Inst. Electrical and Electronics Engineers. Re-printed in SIGPLAN Notices, 1987, 22, pp. 9–25.

Johnson, Norman L.; Samuel Kotz, and N. Balak-rishnan. 1994. Continuous Univariate Distrib-utions, Vol. 1, 2e. NY: Wiley Interscience.

———. 1995. Continuous Univariate Distrib-utions, vol. 2, 2e. NY: Wiley Interscience.

Judd, Kenneth L. 1998. Numerical Methods inEconomics. Cambridge, MA: MIT Press.

Kahan, William. 1997. “The Baleful Effect ofComputer Languages and Benchmarks uponApplied Mathematics, Physics and Chemistry,”John von Neumann Lecture, 45th Annual Meet-ing of SIAM.

Kennedy, William J. and James E. Gentle. 1980.Statistical Computing. NY: Marcel-Dekker.

Knüsel, Leo. 1989. ComputergestützteBerechnung Statisticher Verteilungen. Olden-burg, München-Wien. (English versions atwww.stat.uni-muenchen.de/~knuesel/elv).

———. 1995. “On the Accuracy of the StatisticalDistributions in GAUSS,” Computational Stat-ist. Data Analysis, 20:6, pp. 699–702.

———. 1996. “Telegrams,” Computational Statist.Data Analysis. 21:1, p. 116.

———. 1998. “On the Accuracy of Statistical Dis-tributions in Microsoft Excel,” ComputationalStatist. Data Analysis, 26:3, pp. 375–77.

Knuth, Donald E. 1997. The Art of Computer Pro-gramming, 3e. Reading, MA: Addison-Wesley.

Lachenbruch, P. A. 1983. “Statistical Programs forMicrocomputers,” Byte, 8:8, pp. 560–70.

L’Ecuyer, Pierre. 1992. “Testing Random NumberGenerators,” in Proceedings of the 1992 WinterSimulation Conference. J. J. Swain, D. Gold-smith, R. C. Crain, and J. R. Wilson, eds. NY:IEEE Press, pp. 305–13.

———. 1994. “Uniform Random Number Genera-tion,” Annals of Operations Research, 53, pp.77–120.

Lesage, James P. and Stephen D. Simon. 1985.“Numerical Accuracy of Statistical Algorithmsfor Microcomputers,” Computational Statist.Data Analysis, 3:1, pp. 47–57.

Letson, David and B. D. McCullough. 1998. “Bet-ter Confidence Intervals: The Double Bootstrapwith No Pivot,” Amer. J. Agr. Econ., 80:3, pp.552–59.

Ling, Robert F. 1974. “Comparison of SeveralAlgorithms for Computing Sample Means andVariances,” J. Amer. Statist. Assoc., 69:348, pp.859–66.

Longley, James W. 1967. “An Appraisal of Com-puter Programs for the Electronic Computerfrom the Point of View of the User,” J. Amer.Statist. Assoc., 62:319, pp. 819–41.

Lovell, Michael C. and David D. Selover. 1994.“Econometric Software Accidents,” Econ. J.,104, pp. 713–26.

MacKie-Mason, Jeffrey K. 1992. “EconometricSoftware: A User’s View,” J. Econ. Perspectives,6:4, pp. 165–88.

MacKinnon, James G. 1996. “Numerical Distrib-ution Functions for Unit Root and Coin-tegration Tests,” J. Appl. Econometrics, 11:6,pp. 601–18.

Marsaglia, George. 1968. “Random Numbers FallMainly in the Planes,” Proceedings Nat.Academy Sciences USA, 60, pp. 25–28.

———. 1985. “A Current View of Random Num-ber Generators,” Computer Science and Statistics:16th Symposium on the Interface. L. Billard, ed.Amsterdam: North-Holland, pp. 3–10.

———. 1993. “Monkey Tests for Random NumberGenerators,” Computers and Math. with Appli-cations, 26:9, pp. 1–10.

———. 1996. DIEHARD: A Battery of Tests ofRandomness. http://stat.fsu.edu/pub/ diehard.

McCullough, B. D. 1997. “Benchmarking Numeri-cal Accuracy: A Review of RATS v4.2,” J. Appl.Econometrics, 12:2, pp. 181–90.

———. 1998. “Assessing the Reliability of Statisti-cal Software: Part I,” Amer. Statistician, 52:4,pp. 358–66.

———. 1998a. “Algorithm Choice for (Partial)Autocorrelation Functions,” J. Econ. Soc.Meas., forthcoming.

———. 1999. “Assessing the Reliability of Statisti-cal Software: Part II,” Amer. Statist., forthcom-ing.

———. 1999a. “Econometric Software Reliability:EViews, LIMDEP, TSP and SHAZAM,” J.Appl. Econometrics, forthcoming.

———. 1999b. “Wilkinson’s Tests and Econo-metric Software,” J. Econ. Soc. Meas., forth-coming.

McCullough, B. D. and Charles G. Renfro. 1999.“Benchmarks and Software Standards: A CaseStudy of GARCH Procedures,” J. Econ. Soc.Meas., forthcoming.

McCullough, B. D. and H. D. Vinod. 1998. “Im-plementing the Double Bootstrap,” Computa-tional Econ., 12:1, pp. 79–95.

McCullough, B. D. and Berry Wilson. 1999. “Onthe Accuracy of Statistical Procedures in EX-CEL97,” Computational Statist. Data Analysis,forthcoming.

McKenzie, Colin. 1998. “A Review of Microfit4.0,” J. Appl. Econometrics, 13:1, pp. 77–89.

Murray, W. 1972. “Failure, the Causes andCures,” in Numerical Methods for Uncon-strained Optimization. W. Murray, ed. NY: Aca-demic Press, pp. 107–22.

Newbold, Paul; Christos Agiakloglou, and JohnMiller. 1994. “Adventures with ARIMA Soft-ware,” Int. J. Forecasting, 10:4, pp. 573–81.

Park, Stephen K. and Keith W. Miller. 1988. “Ran-dom Number Generators: Good Ones Are Hardto Find,” Communications of ACM, 31:10, pp.1192–201.

Press, William H.; Saul A. Teukolsky, William T.Vetterling, and Brian R. Flannery. 1994. Nu-merical Recipes in Fortran, 2e. New York:Cambridge U. Press.

Priestley, M. B. 1981. Spectral Analysis and TimeSeries. London: Academic Press.


Renfro, Charles. 1997. “Normative Considerationsin the Development of a Software Package forEconometric Estimation,” J. Econ. Soc. Meas.,23, pp. 277–330.

Ripley, B. D. 1990. “Thoughts on PseudorandomNumber Generators,” J. Computational Appl.Math., 31:1, pp. 153–63.

Rogers, Janet; James Filliben, Lisa Gill, WilliamGuthrie, Eric Lagergren, and Mark Vangel.1998. “StRD: Statistical Reference Datasets forTesting the Numerical Accuracy of StatisticalSoftware,” NIST# 1396, Nat. Inst. Standardsand Tech.

Sawitzki, Günther. 1985. “Another RandomNumber Generator Which Should Be Avoided,”Statist. Software Newsletter, 11, pp. 81–82.

———. 1994. “Testing Numerical Reliability ofData Analysis Systems,” Computational Statist.Data Analysis, 18:2, pp. 269–86.

———. 1994a. “Report on the Numerical Reliabil-ity of Data Analysis Systems,” ComputationalStatist. Data Analysis (SSN), 18:2, pp. 289–301.

Silk, Julian. 1996. “System Estimation: A Compari-son of SAS, SHAZAM, and TSP,” J. Appl.Econometrics, 11:4, pp. 437–50.

Simon, Stephen D. and James P. Lesage. 1988.“Benchmarking Numerical Accuracy of Statisti-cal Algorithms,” Computational Statist. DataAnalysis, 7:2, pp. 197–209.

Stephens, Michael A. 1986. “Tests for the UniformDistribution,” in Goodness-of-Fit Techniques.R. D’Agostino and M. Stephens, eds. NY:Marcel-Dekker, pp. 331–66.

Stuart, Alan and J. Keith Ord. 1991. Kendall’s Ad-vanced Theory of Statistics, vol. 2. NY: OxfordU. Press.

Tauchen, George. 1993. “Remarks on My Term atJBES,” J. Bus. Econ. Statist., 11:4, pp. 426–31.

Tezuka, She. 1995. Uniform Random Numbers:Theory and Practice. Boston: Kluwer.

Thisted, Ronald A. 1988. Elements of StatisticalComputing. NY: Chapman and Hall.

Tjostheim, Dag and Jostein Paulsen. 1983. “Bias ofSome Commonly Used Time Series Estimates,”Biometrika, 70:2, pp. 389–99.

Vandergraft, J. S. 1983. Introduction to NumericalComputations. Orlando, FL: Academic Press.

Veall, Michael R. 1991. “Shazam 6.2: A Review,” J.Appl. Econometrics, 6:3, pp. 317–20.

Vinod, H. D. 1982. “Enduring Regression Estima-tor,” in Times Series Analysis: Theory andPractice 4. O. D. Anderson, ed. Amsterdam:North-Holland, pp. 397–416.

———. 1989. “A Review of Soritec 6.2,” Amer.Statistician, 43:4, pp. 266–69.

———. 1993. “Bootstrap, Jackknife, andResampling Methods: Applications in Econo-metrics,” in Handbook of Statistics: Econo-metrics, vol. 11. G. S. Maddala, C. R. Rao, andH. D. Vinod, eds. NY: North-Holland, pp.629–61.

———. 1997. “Using Godambe-Durbin Estimat-ing Functions in Econometrics,” in SelectedProceedings of the Symposium on EstimatingFunctions. I. Basawa, V. P. Godambe, and R.Taylor, eds. IMS Lecture Notes MonographSeries, 32, pp. 215–37.

Vinod, H. D. and A. Ullah, 1981. Recent Advancesin Regression Methods. NY: Marcel-Dekker.

Vinod, H. D. and L. R. Shenton. 1996. “Exact Mo-ments for Autoregressive and Random WalkModels for a Zero or Stationary Initial Value,”Econometric Theory, 12:3, pp. 481–99.

Wampler, Roy H. 1980. “Test Procedures and TestProblems for Least Squares Algorithms,”J. Econometrics, 12:1, pp. 3–22.

Watson, Mark W. 1993. “Measure of Fit for Cali-brated Models,” J. Polit. Econ., 101:6, pp.1011–41.

Wilkinson, J. H. 1963. Rounding Errors in Alge-braic Processes. Englewood Cliffs, NJ: Pren-tice-Hall.

Wilkinson, Leland. 1985. Statistics Quiz. Evanston,IL: SYSTAT, Inc. (at http://www.tspintl.com/benchmarks).

———. 1994. “Practical Guidelines for TestingStatistical Software,” in Computational Statis-tics. P. Dirschedl and Rüdiger Ostermann, eds.Berlin: Physica-Verlag, pp. 111–24.

Wilkinson, Leland and Dallal, Gerard E. 1977.“Accuracy of Sample Moments CalculationsAmong Widely Used Statistical Programs,”Amer. Statistician, 31:3, pp. 128–31.


Documents

McCullough and Vinod: The Numerical Reliability of Econometric