14
Rural Sociology 1991, pp. 461-474 Copyright © 1991 by the Rural Sociological Society RESEARCH NOTE On the Utility of Robust and Resampling Procedures 1 Thomas Dietz, Linda KaloJ, * and R. Scott Frey ** Department of Sociology and Anthropology, George Mason University, Fairfax, Virginia 22030 * Department of Sociology, State University of New York, Plattsburgh, New York 12901 **Department of Sociology, Anthropology, and Social Work, Kansas State University, Manhattan, Kansas 66506 ABSTRACT Kassab (1990) makes an important methodological contri- bution by urging the use of robust regression methods in the study of community economic impacts and by indicating the utility of the bootstrap in assessing standard errors in robust regression. By introducing the notion of a contaminating distribution, we reconcile differences between her claim that ordinary least squares (OLS) regression is biased when outliers are present and standard linear model, theory that does not make as- sumptions about the shape of the residual distribution in proving OLS an unbiased estimator. The contaminating distribution provides a framework for rural sociologists to link their statistical assumptions to a substantive understanding of the phenomena being studied. We suggest an alternative regression estimation strategy that may be more robust than the technique she uses. We also discuss an approach to bootstrapping that is more ap- propriate for macro-level social indicator data than the one she describes. An appendix discusses the software available for implementing these meth- ods. Introduction Kassab (1990) provides a clear argument that robust and resampling procedures are more appropriate than traditional ordinary least squares (OLS) in estimating impacts of changing economic structure on communities. We agree with her position and, in this note, expand upon and reenforce her discussion which we believe is applicable not only to studies of community impacts but to all research making use of macro-level social units of analysis, such as communities, nation- states, or counties. Over the last decade, a number of articles ap- pearing in Rural Sociology have used this type of data, so the argument for robust and resampling methods is of general relevance to rural sociologists. 1 We thank Cathy Kassab and the Rural Sociology reviewers for their comments on this paper, the Computer Center at the State University of New York at Plattsburgh for computing assistance, and the STATA technical staff for their help with software. The research was supported in part by the National Science Foundation grant number SES-9109928.

On the Utility of Robust and Resampling Procedures

Embed Size (px)

Citation preview

Page 1: On the Utility of Robust and Resampling Procedures

Rural Sociology 56(~), 1991, pp. 461-474Copyright © 1991 by the Rural Sociological Society

RESEARCH NOTE

On the Utility of Robust andResampling Procedures1

Thomas Dietz, Linda KaloJ, * and R. Scott Frey**Department of Sociology and Anthropology,George Mason University, Fairfax, Virginia 22030*Department of Sociology,State University of New York, Plattsburgh, New York 12901**Department of Sociology, Anthropology, and Social Work,Kansas State University, Manhattan, Kansas 66506

ABSTRACT Kassab (1990) makes an important methodological contri­bution by urging the use of robust regression methods in the study ofcommunity economic impacts and by indicating the utility of the bootstrapin assessing standard errors in robust regression. By introducing the notionof a contaminating distribution, we reconcile differences between herclaim that ordinary least squares (OLS) regression is biased when outliersare present and standard linear model, theory that does not make as­sumptions about the shape of the residual distribution in proving OLS anunbiased estimator. The contaminating distribution provides a frameworkfor rural sociologists to link their statistical assumptions to a substantiveunderstanding of the phenomena being studied. We suggest an alternativeregression estimation strategy that may be more robust than the techniqueshe uses. We also discuss an approach to bootstrapping that is more ap­propriate for macro-level social indicator data than the one she describes.An appendix discusses the software available for implementing these meth­ods.

Introduction

Kassab (1990) provides a clear argument that robust and resamplingprocedures are more appropriate than traditional ordinary leastsquares (OLS) in estimating impacts of changing economic structureon communities. We agree with her position and, in this note, expandupon and reenforce her discussion which we believe is applicable notonly to studies of community impacts but to all research making useof macro-level social units of analysis, such as communities, nation­states, or counties. Over the last decade, a number of articles ap­pearing in Rural Sociology have used this type of data, so the argumentfor robust and resampling methods is of general relevance to ruralsociologists.

1 We thank Cathy Kassab and the Rural Sociology reviewers for their comments onthis paper, the Computer Center at the State University of New York at Plattsburghfor computing assistance, and the STATA technical staff for their help with software.The research was supported in part by the National Science Foundation grant numberSES-9109928.

Page 2: On the Utility of Robust and Resampling Procedures

462 Rural Sociology, Vol. 56, No.3, Fall 1991

We wish to make three general points. First, we consider the es­timation properties of 0 LS and robust approaches and resolve someapparent contradictions between Kassab's assertions regarding biasin ordinary least squares regression coefficients and the results ofstandard linear model theory. To clarify the properties of variousestimation methods, we sketch a model for estimation with macro­level indicators that invokes the idea of the contaminating distribu­tion. This places the rationale for robust estimation in a conceptualframework that is accessible to users of traditional regression meth­ods. Second, we note a potential weakness of the procedure she usesto obtain robust estimates and suggest some alternatives to it. Third,we examine some alternative strategies for bootstrapping regressionsand illustrate these points with a brief example. Appendix 2 de­scribes computer software for using the suggested procedures. Ourgeneral goal is to expand Kassab's discussion so as to facilitate theunderstanding and use by rural sociologists of these new and impor­tant methods.

Estimation properties of OLS androbust regression estimators

Kassab (1990:358) suggests that OLS estimates of regression para­meters can be biased ifoutliers are present. Yet standard linear modelproofs demonstrate that OLS estimates will be unbiased as long asthe residuals are (1) not correlated with the independent variablesand (2) have an expected value of zero. Such proofs also demonstratethat OLS will have minimum variance among linear unbiased esti­mators if in addition to (1) and (2), (3) the independent variables arenon-stochastic, (4) the residuals are not correlated with themselvesacross observations (no autocorrelation), and (5) the residuals haveconstant variance (homoscedasticity) (Hampel et al. 1986:308-11;Zellner 1983: 129-32). All these properties can be established withoutany assumption about the normality of residuals. But a proof thatOLS has minimum variance among the class ofall unbiased estimators(including non-linear estimators such as the robust approaches con­sidered here) requires that the residuals be normally distributed andthe five assumptions above be met.

How can the difference between Kassab's assertion ofbias and thesestandard proofs be reconciled? The differences result from the useof different conceptual models of the sampling process in regression.Traditional estimation theory considers the values of the independentvariables, collectively called the design matrix, the model matrix, orthe carriers (depending on the author), as fixed and both the residualsand the independent variables as representative of the causal regimeunder study. Under these assumptions, OLS is robust with regard tobias, but its efficiency as an estimation procedure is sensitive to theshape of the residual distribution. Even relatively small deviations

Page 3: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 463

from normality can produce a substantial reduction in efficiency; sothat in small samples, estimates of regression coefficients will be quitevariable, and standard errors can have considerable bias (Dietz et al.1987).

Most robust estimation theory employs a different conceptual mod­el (e.g., Rousseeuw and Leroy 1987; Hampelet al. 1986; Li 1985;Wu 1985; see Dietz et al. 1987 for an explication). Under the simplestmodel, it is assumed that the independent variables are fixed (will notvary from sample to sample in a sampling experiment) and that themajority of them are representative of the causal regime under study.It is further assumed that a majority of residuals are sampled fromthat causal regime. But a fraction (up to 50%) of the observations aresampled from a different causal regime referred to as a contaminatingdistribution. Contamination of the independent variables, the resid­uals, or both can occur. Estimates of the parameters of the causalregime of interest are derived from the full set of observations, in­cluding the legitimate observations and those from the contaminatingdistribution. Such estimates are biased because they include a mix of"good" and "bad" observations.

This suggests that the researcher must think carefully about theprocess that generated the observations under study. One view im­plicit in Kassab's discussion is that the observations are generated byboth the causal regime of interest and a contaminating distribution.In her substantive example, it may be that some communities in thedata set experience a radically different relationship between em­ployment mix and income than the majority of communities and thatthese differences are not captured by the model being estimated.Robust procedures allow the researcher to recover estimates of thecausal forces affecting the majority of communities. Though Kassabdid not emphasize this point, robust procedures also allow the iden­tification ofoutliers-the observations that are not part of the generalcausal regime (Rousseeuw and Leroy 1987). These outliers may proveof interest for further study, because an understanding ofwhy outliersdo not match the general theoretical model may aid in refining theoryand model specification.

Robust estimation procedures are appropriate even when the re­searcher does not consider the data used for estimation as a mixtureof observations from the population of interest and a contaminatingdistribution and, thus, does not suspect bias. The finite sample effi­ciency ofOLS estimates depends on normality, Even small departuresfrom normality can drastically reduce OLS efficiency. There is rarelyany theoretical justification for assuming normality of the populationresiduals. Diagnostics of the sample residuals (Bollen and Jackman1985) and formal tests of normality (Royston 1983) can lend credenceto the case for population normality, but they are not definitive ev­idence (Atkinson 1986; Royston 1991). We have elsewhere arguedthat non-normality may be the rule rather than the exception in

Page 4: On the Utility of Robust and Resampling Procedures

464 Rural Sociology, Vol. 56, No.3, Fall 1991

studies using macro-level social indicator data (Dietz et al. 1987:383).And Levy (1984) has demonstrated that many generalized least squaresprocedures used to correct for autocorrelation and heteroscedasticitytransform normal residuals into long-tailed distributions. Thus, webelieve that robust procedures should be used routinely.

Alternative robust estimators

Most early discussions of robust regression focused on the problemofoutliers in the residuals. Popular robust methods such as biweightedleast squares (BLS) and least absolute deviations (LAD) were devel­oped to provide good estimates (representative of the majority ofdatapoints and relatively efficient) when the residual distribution containsoutliers (Dietz et al. 1987). BLS gives low weights to outliers. LADminimizes the sum of the absolute deviations of the residuals, incontrast to the ordinary least squares approach which minimizes thesum of the squared residuals. If the only problem in residuals isoutliers, even relatively simple diagnostic techniques can help identifythe problem (Bollen and Jackman 1985). But unusual values of theindependent variables (values that can be thought of as coming froma causal regime different than that under study) may occur withoutproducing large residuals. If the unusual values of the independentvariables have the right configuration (a particularly perverse onefrom the analyst's point of view), they will have great influence onthe estimated coefficients and can have very small residuals (see Rous­seeuw and Leroy 1987: 11 for an example in two dimensions). Ourtraditional tool, ordinary least squares, is very sensitive to outliers ineither the residuals or the independent variables. BLS and LAD arerobust with regard to outliers in the residuals, but can also fall victimto outliers in the independent variables.

Kassab uses a robust regression procedure developed by Sheatherand Hettmansperger (1987). The procedure is one ofa popular familyof robust estimators called generalized-M (GM) estimators. M-esti­mators were developed to provide a maximum likelihood approachto robust estimation (Hampel et al. 1986: 100-10). The M-estimatesweight cases according to the size of their residuals. BLS is an ex­ample. They are reasonably efficient when the residual distributionis normal and are not greatly influenced by outliers in the residuals.But, as noted above, they are vulnerable to unusual values in thecarriers. GM-estimators were developed to retain the desirable prop­erties of M-estimators, but also provide some protection against out­liers in the carriers (Rousseeuw and Leroy 1987:12-13). They usemeasures of the influence of an observation on the estimated coef­ficients to identify unusual data points. Observations that are highlyinfluential receive lower weights than observations that have averageinfluence, so that unusual observations have relatively little impact

Page 5: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 465

on the weighted estimates. Because the GM-estimators limit the in­fluence of anyone observation on the parameter estimates, they arealso called bounded influence estimators. If we assume that the out­liers (which may occur in either the independent variables or theresiduals) are from a causal regime different than the one of interest,the GM-estimates try to recover the "true" parameters by giving littleweight to observations from the contaminating distribution, whetherthe contamination is in residuals or in carriers or both.

The GM-estimator used by Kassab protects against outliers in bothresiduals and in independent variables by developing weights basedon the DFFITs diagnostic (Belsley et al. 1980), which measures howmuch deleting a case from the analysis would change the regressioncoefficients or, equivalently, the fit of the equation (see, for example,Rousseeuw and Leroy 1987:227-29). Since the DFFITS take accountof outliers in both residuals and independent variables, the resultingestimate should be robust with regard to either type ofcontamination.This has made DFFITS popular as a diagnostic tool even when robustestimation is not used (Bollen and Jackman 1985).

Unfortunately, the DFFITS assess the influence of an observationby measuring how much estimates would change if that single obser­vation were dropped. A data set can contain several unusual casesthat mask each other when viewed one at a time (Atkinson 1986). Ifthis seems a remote and even perverse possibility, note that in twoout of the three data sets examined by Rousseeuw and Leroy (1987:229-34) exactly such problems occur and the DFFITS do not revealproblematic cases. Thus DFFITS has some vulnerability both as adiagnostic and as a basis for a GM estimator.

The least median of squares (LMS) procedure developed by Rous­seeuw (Rousseeuw 1984: Rousseeuw and Leroy 1987) proves to berobust against outliers in residuals and outliers in independent vari­ables whether the outliers are single cases or groups of cases. LMSestimates minimize the median of the squared residuals rather thanthe sum of the squares (OLS) or the sum of the absolute deviations(LAD). Unfortunately, LMS, while highly robust, is not efficient. Thisleads us to suggest a multi-step strategy for robust estimation. OLSestimates should always be calculated, because they are traditionaland are optimal if the residual distribution is actually normal and nocontamination (outliers) is present. As a second step, LMS estimatesprovide a very robust set of coefficient estimates and diagnostics, butare inefficient. In a third step, the LMS estimates and diagnostics areused as the starting point for more efficient but less robust procedures(such as BLS or GM-estimate) such as that used by Kassab. Indeed,Rousseeuw and Leroy suggest that even simple one-step re-weightedOLS (RLS) using weights based on LMS is both efficient and robust.

Our general point is that there are several alternatives to the robustprocedure used by Kassab. The best strategy is a mixed one. Because

Page 6: On the Utility of Robust and Resampling Procedures

466 Rural Sociology, Vol. 56, No.3, Fall 1991

each of several robust estimators has strengths and weaknesses, it ispossible to combine them into a highly robust and efficient strategythat is appropriate to the character of the macro-level social indicatordata.

Bootstrapping strategies

The sampling properties of robust regression procedures are not wellunderstood even in large samples, and the small sample theory forthem is extremely limited. As a result, bootstrapping is becoming anincreasingly popular strategy for assessing sampling variability of ro­bust estimators (Dietz et al. 1987; Efron and Tibshirani 1985). Boot­strapping draws samples of size n with replacement from the sample,which is also of size n. In a sense, bootstrapping conducts a samplingexperiment to generate the sampling distribution, using the actualsample as if it were the population. For a variety of problems, in­cluding robust regression, bootstrapping provides a description ofthe sampling distribution and estimates of standard errors that applyeven when standard parametric estimates, such as those from maxi­mum likelihood theory, have not been developed. It is, in a sense anon-parametric method for maximum likelihood estimation. Boot­strapping does not rely on assumptions about the population, such asnormality of residuals, that can be very hard to justify in typicalapplications with macro-level social indicator data.

Kassab makes use of residual resampling, a common strategy forbootstrapping regression models. She first generates residuals usingrobust regression on the original sample. A sample of size n is drawnwith replacement from these residuals. Next, bootstrap values of thedependent variable are generated by adding a residual from the boot­strap sample of residuals to a predicted value of the dependent vari­able for each observation. That is, for each case she generates

y* = Yhat + e*

where y* is the bootstrap estimate of Y, Yhat is the predicted valueof Y for that observation, and e* is a residual assigned from thebootstrapped sample. A regression is estimated using the independentvariables for the case to predict Y*. The resulting regression coef­ficients are stored. The process is repeated many times, each timewith a new sample of the residuals from the original analysis used togenerate a new set of y* and then a new regression estimated usingy* and the original values of the independent variables. The distri­bution of estimated regression coefficients across the bootstrap sam­ples can be treated as a sampling distribution. The standard deviationof this distribution can be considered a non-parametric estimate ofthe regression coefficient standard errors.

Page 7: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 467

A somewhat simpler procedure called data resampling can also beused. One draws a sample of size n, with replacement, from theobservations (independent and dependent variables). This is a boot­strap sample. The regression is estimated for the bootstrap sampleand the coefficients stored. A new bootstrap sample of size n is againdrawn with replacement from the original data set, the regressionestimated, and the coefficients stored. The process is repeated enoughtimes to generate a bootstrapped sampling distribution. Again, thedistribution and standard deviation of coefficients across the boot­strap samples provide a guide to sampling uncertainty.

The major conceptual difference between these two approaches isin the assumptions that each makes about the data. Residual resam­pIing treats the independent variables as "fixed" or non-stochastic,as is common in experimental designs. The second approach assumesthat the independent variables are themselves generated by a sam­pling process. Since macro-level social indicator data are not gener­ated by design, we consider the second approach more realistic formost applications, although residual resampling is preferable withtime-series or other data in which the residuals have a structure thatshould be preserved in the sampling process (Freedman and Peters1987; Wu 1986). If there is no structure to the data, it appears thatdata resampling is more conservative (yields larger standard errors)than residual resampling (Aleong and Howard 1986; Efron 1982).

In general, the bootstrap provides a very flexible method of as­sessing sampling variability when analytic theory is either not fullydeveloped or may require dubious assumptions, such as normality ofthe population residual distribution. Bootstrapped estimates are easilycomputed and provide not only standard errors, but also a pictureof the sampling distribution. We believe that rural sociologists usingmacro-level social indicator data should generate bootstrapped esti­mates to complement standard errors based on normal theory sincenormal theory assumptions are not realistic in this type of research.

An example

To illustrate these points, we estimate a simple macro-level compar­ative model using a number of robust techniques and provide boot­strapped estimates of the uncertainty of the resulting coefficients.Following Rudel (1989), we will examine the determinants of defor­estation. Our dependent variable will be the rate of deforestation,measured as the average annual rate of decline in a country's forestsduring the period 1981-1985. This was calculated by dividing theaverage annual decline by the size of the closed forest area in 1980(World Resources Institute 1986: Table 6.1). To aid exposition, weuse a simple model, albeit one that has been much debated in theliterature on global change (Ehrlich and Holdren 1974; Ehrlich and

Page 8: On the Utility of Robust and Resampling Procedures

468 Rural Sociology, Vol. 56, No.3, Fall 1991

Ehrlich 1990; U.S. National Research Council Committee on HumanDimensions of Global Change 1991). We specify that environmentalimpact will be the result of population pressure multiplied by eco­nomic activity. Gross national product per capita in 1980 is our mea­sure of economic activity (World Bank 1989: Table 1). Our measureofpopulation is 1980 population (World Bank 1983: Social Data Sheet1) divided by size of the closed forest area in 1980 (World ResourcesInstitute 1986: Table 6.1). This density measure can be interpretedas population pressure. We have linearized the multiplicative modelby taking the natural logarithms of both dependent and independentvariables:

LDEFOR = BO + (B1·LDEN) + (B2·LGNP) + e

where

LDEFOR IS the logarithm of the deforestation rate;LDEN IS the logarithm of density;LGNP is the logarithm of gross national product per capita;

BO, B1, B2 are the regression coefficients; ande is the residual.

The sample is 35 developing countries (listed in Appendix A) withsubstantial rain forests for which data were available.

Table 1 presents the results of four estimation strategies. OLS isordinary least squares, the traditional strategy; LMS is least medianof squares, which is highly robust but inefficient; RLS is Rousseeuwand Leroy's suggested strategy ofreweighting OLS based on the LMSresiduals; and GM is Kasab's estimator based on DFFITS.2 OLS resultsindicate that population density has a significant effect on the defor­estation rate, while economic activity does not. The GM estimategives similar results, but the coefficient for LGNP is roughly threetimes as large as for OLS. GM strongly downweights Gabon, Trinidadand Tobago, and Papua New Guinea, but moderately downweightsGuinea-Bissau and India. LMS gives much more weight to the effectof economic activity. The RLS estimates based on LMS suggest thatboth density and economic activity have significant effects on the de­forestation rate, with deforestation increasing as both density andgross national product per capita increases. For RLS, like GM, Gabon,Guinea-Bissau, Trinidad and Tobago are downweighted. RLS alsodownweights Venezuela and Nepal, while it does not find India orPapua New Guinea unusually influential.

2 M and GM estimators, as well as some other robust techniques, use "tuning con­stants" to determine how stringent they will be in rejecting outliers. We have used atuning constant of 1 for Kassab's estimator. We also experimented with constants of1.5 and 2 as suggested by Kassab and by Sheather and Hettmansperger (1987). Theseproduced results quite similar to those reported here.

Page 9: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 469

Table 1. Alternative robust estimates of deforestation model

OLS LMS RLS GME

Intercept -5.341 * -14.114 -11.194* -6.106*(1.612) (1.939) (1.652)

LDEN 0.294* 0.396 0.442* 0.289*(0.125) (0.120) (0.123)

LGNP 0.067 1.408 0.967* 0.196(0.239) (0.292) (0.246)

R-square 0.182 0.466 0.363 0.158

Cases down weighted: OLS NoneLMS naRLS Weight 0: 4, 7, 27, 28, 31GME Weight 0.75 to 0.50: 7, 29

Weight less than 0.50: 4, 27, 32

* t > 1.96.Note: OLS is ordinary least squares; LMS is least median squares; RLS is reweighted

least squares based on LMS; andGME is bounded influence regression. Standard errorsare in parentheses.

Table 2 and Figure 1 present the results of bootstrapping OLS,RLS, and GM. We used data resampling and, for the robust esti­mators, bootstrapped only the last step in the iterative estimationprocedure.s The results are based on 250 bootstrap replications. Acomparison of bootstrapped standard errors with those provided byasymptotic theory suggests that for OLS and RLS the asymptotictheory standard errors are too optimistic, while for GM the asymptoticstandard errors are too pessimistic. Figure 1 displays the samplingdistribution for each coefficient under each estimation method. Thesignificance of LDEN is obvious, only rarely did bootstrapped samplesproduce coefficients near zero. The difference between RLS andeither OLS or GM in estimating the effect of LGNP also is clear. ForRLS very few bootstrap estimates fall near zero, while a large numberof OLS and GM estimates do.

The value of both robust estimation and bootstrapping comes fromtheir ability to reveal patterns that might otherwise be missed. Forexample, Gabon, Venezuela, and Trinidad and Tobago have rela­tively high per capita gross national products because of their petro­leum exports, but have only moderate deforestation rates; thus, theirpresence in the sample reduces the estimated effect ofLGNP. Guinea­Bissau and Nepal have very high deforestation rates, but are quite

3 An alternative would be to replicate the entire robust estimation strategy (LMSfollowed by RLS, and OLS followed by GM). Indeed, one advantage of bootstrappingis that it can provide information about uncertainty for research strategies that involveiterative processes and decision criteria (Efron and Tibshirani 1985). We have limitedourselves to a simpler approach in the interests of clarity.

Page 10: On the Utility of Robust and Resampling Procedures

470 Rural Sociology, Vol. 56, No.3, Fall 1991

OLS RLS GME

LGNP

LDEN

-1

J ••• l

-1

Value of CoefficientFigure 1. Bootstrapping results

S28

UlIIIrlC.E.10en....oC .528o.....-l-Ic,oC.oc,

n, 0 -4:::~~~==;)--1

poor; so again, their presence in the sample depresses the estimatedeffect of economic activity on deforestation.

Both GM and LMS identified a similar set of countries as outliers.GM did not identify either Venezuela or Nepal as highly deviant fromthe rest of the data, while LMS did. GM treated India and PapuaNew Guinea as outliers, while LMS did not. But both point to theproblem of a few discrepant observations, and LMS downweightsthose data points sufficiently to suggest a different conclusion thanwould be reached from the GM-estimates about the influence ofeconomic activity on deforestation. The differences that emerge arebecause, in a statistical sense, LMS is more robust than GM. Thisdoes not necessarily make it superior. For some problems, LMS mightbe too robust, leading to downweighting of data points that are tellingan important part of the story. That is why we emphasize that robustestimators and especially LMS must be used as guides in analysis. Itmay be that the unusual data points identified by LMS can provide

Table 2. Bootstrapped regressions of deforestation model (based on250 replications)

OLS RLS GME

Intercept -5.595* -11.188* -6.054*(2.003) (2.202) (1.571)

LDEN 0.296* 0.436* 0.279*(0.131) (0.134) (0.113)

LGNP 0.108 0.971 * 0.192(0.297) (0.334) (0.103)

* t > 1.96.Note: OLS is ordinary least squares; LMS is least median squares; RLS is reweighted

least squares based on LMS; ana GME is Kassab's bounded influence regression. Stan­dard errors are in parentheses.

Page 11: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 471

important theoretical insights. Indeed, a major emphasis of Kassab's(1990) article was to develop a model of heterogeneity in the popu­lation of communities, using interaction terms to capture that het­erogeneity. Robust procedures help to identify such heterogeneity,and thus can suggest modifications to theory and to models.

Conclusions

In addition to her substantive results, Kassab makes an importantmethodological contribution by pointing out the importance of robustand resampling regression procedures when analyzing macro-levelsocial indicator data. Here we have expanded on her discussion, ex­plicating some issues regarding the logic of robust regression andsuggesting some alternative approaches to both robust regression andto bootstrapping. Since macro-level social indicator data are com­monly used in rural sociology, these methods are relevant in manycontexts. We urge the increased use of them because they are muchmore powerful than OLS regression and provide both conceptuallogic and diagnostic tools to help rural sociologists think more care­fully about the data and models under consideration.

References

Aleong, John, and Diantha Howard1986 "Bootstrapping regression equations." Pp. 287-92 in American Statistical

Association (ed.), 1986 Proceedings of the Statistical Computing Section.Washington, DC: American Statistical Association.

Atkinson, A. C.1986 "Masking unmasked." Biometrika 73:533-41.

Belsley, David A., Edwin Kuh, and Roy E. Welsch1980 Regression Diagnostics: Identifying Influential Data and Sources of Collin­

earity. New York: Wiley.Bollen, Kenneth A., and Robert W. Jackman

1985 "Regression diagnostics: an expository treatment of outliers and influentialcases." Sociological Methods and Research 13:510-42.

Carson, Richard T.1985 "SAS macros for bootstrapping and cross-validating regression equations."

Pp. 1059-63 in SUGI (ed.), Proceedings of the Tenth Annual SAS UsersGroup International Conference. Cary, NC: SAS Institute.

Dallal, Gerard E.1991 "LMS: least median of squares regression." The American Statistician 15:

74.Dietz, Thomas, R. Scott Frey, and Linda Kalof

1987 "Estimation with cross-national data: robust and nonparametric methods."American Sociological Review 52:380-90.

Efron, Bradley1982 The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia:

Society for Industrial and Applied Mathematics.Efron, Bradley, and Robert Tibshirani

1985 The Bootstrap Method for Assessing Statistical Accuracy. Department ofStatistics, Technical Report No. 101. Stanford, CA: Stanford University.

Page 12: On the Utility of Robust and Resampling Procedures

472 Rural Sociology, Vol. 56, No.3, Fall 1991

Ehrlich, Paul R., and Anne H. Ehrlich1990 The Population Explosion. New York: Simon and Schuster.

Ehrlich, Paul R., and John P. Holdren1974 "Impact of population growth." Science 171: 1212-17.

Freedman, David A., and Stephen C. Peters1987 "Bootstrapping a regression equation: some empirical results." Journal of

the American Statistical Association 79:97-106.Hamilton, Lawrence C.

1991 "How robust is robust regression?" STATA Technical Bulletin 2:21-26.Hampel, Frank R., Elvezio M. Ronchetti, Peter j. Rousseeuw, and Werner A. Stabel

1986 Robust Statistics: The Approach Based on Influence Functions. New York:Wiley.

Kassab, Cathy1990 "Studying economic change: are robust regression procedures needed?" Ru­

ral Sociology 55 (3):357-75.Levy, David

1984 Robust Regression Under Conditions of Normality and Frequent Autocor­relation: Contamination by the Observing Statistician. Fairfax, Virginia: Cen­ter for the Study of Public Choice, George Mason University.

Li, Guoying1985 "Robust regression." Pp. 281-343 in David C. Hoaglin, Frederick Mosteller,

and John W. Tukey (eds.), Exploring Data Tables, Trends and Shapes. NewYork: Wiley.

Rousseeuw, Peter j.1984 "Least median of squares regression." Journal of the American Statistical

Association 79:871-80.Rousseeuw, Peter j., and Annick M. Leroy

1987 Robust Regression and Outlier Detection. New York: Wiley.Royston, j. P.

1983 "A simple method for evaluating the Shapiro-Francia W' test of nonnor­mality." Statistician 32:297-300.

1991 "Tests for departure from normality." STATA Technical Bulletin 2: 16-17.Rudel, Thomas K.

1989 "Population, development and tropical deforestation: a cross-national study."Rural Sociology 54:327-38.

Sheather, S. j., and T. P. Hettmansperger1987 Estimating the Standard Error of Robust Regression Estimates. Department

of Statistics, Technical Reports and Preprints, No. 70. University Park: ThePennsylvania State University.

Tibshirani, Robert j.1985 "Bootstrap computations." Pp. 1059-63 in SUGI (ed.), Proceedings of the

Tenth Annual SAS Users Group International Conference. Cary, NC: SASInstitute.

U.S. National Research Council Committee on Human Dimensions of Global Change1991 Global Environmental Change: The Human Dimensions. Washington, DC:

National Academy Press.World Bank

1983 World Tables, Volume II, Third Edition. Baltimore: Johns Hopkins Univer­sity Press.

1989 World Tables, 1988-1989 Edition. Baltimore: TheJohns Hopkins UniversityPress.

World Resources Institute1986 World Resources, 1986. New York: Basic.

Wu,C. F.1986 "Thejackknife, bootstrap and other resampling plans for regression." Annals

of Statistics 14:1261-1350.

Page 13: On the Utility of Robust and Resampling Procedures

Robust and Resampling Procedures - Dietz et al. 473

Wu, Lawrence1985 "Robust M-estimation of location and regression." Pp. 316-88 in Nancy

Brandon Tuma (ed.), Sociological Methodology 1985. San Francisco: Jossey­Bass.

Zellner, Arnold1983 "Statistical theory and econometrics." Pp. 67-178 in Zvi Grichlesand Michael

D. Inrriligator (eds.), Handbook of Econometrics, Vol. I. Amsterdam: North­Holland.

Appendix 1

Countries used in the analysis

1. Benin2. Burundi3. Cameroon4. Gabon5. Gambia6. Ghana7. Guinea-Bissau8. Ivory Coast9. Kenya

10. Liberia11. Rwanda12. Sierre Leone13. Tanzania14. Togo15. Zaire16. Bolivia17. Brazil18. Colombia

19. Costa Rica20. Ecuador21. Haiti22. Jamaica23. Mexico24. Nicaragua25. Paraguay26. Peru27. Trinidad and Tobago28. Venezuela29. India30. Malaysia31. Nepal32. Papua New Guinea33. Philippines34. Sri Lanka35. Thailand

Page 14: On the Utility of Robust and Resampling Procedures

474 Rural Sociology, Vol. 56, No.3, Fall 1991

Appendix 2

Software for robust and resampling estimation

A few years ago, we noted that it was difficult to use robust and resampling proceduresin most common statistical packages (Dietz et al. 1987). The situation has improvedconsiderably since then. Here we will mention the options of which we are aware.

Robust estimation.Since most M and GM estimators are easily calculated using weighted least squares,any package that produces residual and influence diagnostics can be used for M andGM estimation. Even when the package does not allow use ofdiagnostics in an iterativere-weighting procedure, a one-step estimator can easily be calculated using the dataweighting procedures available in most packages. Diagnostics are obtained from anOLS estimate used to calculate the weights used in a one-step weighted least squares.If the procedure to be used is highly robust, such one-step estimators are nearly asgood as fully iterated estimators, but one-step estimators may not function well withless robust procedures (Hampel et al. 1986: 106). SAS includes PROC LA V whichproduces a least absolute deviations estimate. LMS is available in a program calledPROGRESS described in Rousseeuw and Leroy (1987). Dallal (1991) has written aprogram based on PROGRESS that eases the rather stringent upper limits PROGRESSplaces on the number of independent variables and observations to be used. TheS-PLUS package also includes LMS regression. STATA includes a carefully designedrobust regression routine that produces M-estimates by beginning with OLS, thenusing the well-known Huber estimator (Hampel et al. 1986:311-14) followed by bi­weighted least squares. STATA also includes a GM regression estimator, and it appearsthat new robust estimators will be added in the near future (Hamilton 1991). Pro­gramming robust estimators by reweighted least squares is very simple in STATA.

BootstrappingEfron and Tibshirani (1985) provide a general algorithm for bootstrapping. Tibshirani(1985) and Carson (1985) provide the SAS code for bootstrapping. SPSS-X CNLRprocedure will bootstrap, though the exact method used is not documented and itappears that only limited output is available from the bootstrapping procedure. TheEQS structural equations modeling program provides a bootstrapping option. STATAhas a built-in procedure that allows any STATA statistical procedure or combinationof procedures to be bootstrapped. This makes it relatively easy to use the bootstrapto assess the uncertainty associated with analysis strategies that involve several steps,types of computations, and decision rules rather than being restricted to bootstrappinga single estimator. The computations for this paper were conducted using PROGRESSand STATA.