Missing Data: Our View of the State of the Art - ETH Z · Missing Data: Our View of the State of the Art Joseph L. Schafer and John W. Graham Pennsylvania State University ... seek

Missing Data: Our View of the State of the Art

Joseph L. Schafer and John W. GrahamPennsylvania State University

Statistical procedures for missing data have vastly improved, yet misconception andunsound practice still abound. The authors frame the missing-data problem, reviewmethods, offer advice, and raise issues that remain unresolved. They clear upcommon misunderstandings regarding the missing at random (MAR) concept. Theysummarize the evidence against older procedures and, with few exceptions, dis-courage their use. They present, in both technical and practical language, 2 generalapproaches that come highly recommended: maximum likelihood (ML) and Bayes-ian multiple imputation (MI). Newer developments are discussed, including somefor dealing with missing data that are not MAR. Although not yet in the main-stream, these procedures may eventually extend the ML and MI methods thatcurrently represent the state of the art.

Why do missing data create such difficulty in sci-entific research? Because most data analysis proce-dures were not designed for them. Missingness is usu-ally a nuisance, not the main focus of inquiry, buthandling it in a principled manner raises conceptualdifficulties and computational challenges. Lacking re-sources or even a theoretical framework, researchers,methodologists, and software developers resort to ed-iting the data to lend an appearance of completeness.Unfortunately, ad hoc edits may do more harm thangood, producing answers that are biased, inefficient(lacking in power), and unreliable.

Purposes of This Article

This article’s intended audience and purposes arevaried. For the novice, we review the available meth-ods and describe their strengths and limitations. Forthose already familiar with missing-data issues, weseek to fill gaps in understanding and highlight recentdevelopments in this rapidly changing field. For

methodologists, we hope to stimulate thoughtful dis-cussion and point to important areas for new research.One of our main reasons for writing this article is tofamiliarize researchers with these newer techniquesand encourage them to apply these methods in theirown work.

In the remainder of this article, we describe criteriaby which missing-data procedures should be evalu-ated. Fundamental concepts, such as the distributionof missingness and the notion of missing at random(MAR), are presented in nontechnical fashion. Olderprocedures, including case deletion and single impu-tation, are reviewed and assessed. We then review andcompare modern procedures of maximum likelihood(ML) and multiple imputation (MI), describing theirstrengths and limitations. Finally, we describe newtechniques that attempt to relax distributional assump-tions and methods that do not assume that missingdata are MAR.

In general, we emphasize and recommend two ap-proaches. The first is ML estimation based on allavailable data; the second is Bayesian MI. Readerswho are not yet familiar with these techniques maywish to see step-by-step illustrations of how to applythem to real data. Such examples have already beenpublished, and space limitations do not allow us torepeat them here. Rather, we focus on the underlyingmotivation and principles and provide references sothat interested readers may learn the specifics of ap-plying them later. Many software products (both freeand commercial) that implement ML and MI are listedhere, but in this rapidly changing field others will

Joseph L. Schafer, Department of Statistics and the Meth-odology Center, Pennsylvania State University; John W.Graham, Department of Biobehavioral Health, Pennsylva-nia State University.

This research was supported by National Institute onDrug Abuse Grant 1-P50-DA10075.

Correspondence concerning this article should be ad-dressed to Joseph L. Schafer, The Methodology Center,Pennsylvania State University, Henderson S-159, Univer-sity Park, Pennsylvania 16802. E-mail: [email protected]

Psychological Methods Copyright 2002 by the American Psychological Association, Inc.2002, Vol. 7, No. 2, 147–177 1082-989X/02/$5.00 DOI: 10.1037//1082-989X.7.2.147

147

undoubtedly become available soon. A resource Webpage maintained by John W. Graham (http://methodology.psu.edu/resources.html) will providetimely updates on missing-data applications and util-ities as they evolve, along with step-by-step instruc-tions on MI with NORM (Schafer, 1999b).

Fundamentals

What Is a Missing Value?

Data contain various codes to indicate lack of re-sponse: “Don’t know,” “Refused,” “Unintelligible,”and so on. Before applying a missing-data procedure,one should consider whether an underlying “true”value exists and, if so, whether that value is unknown.The answers may not be obvious. Consider a ques-tionnaire for adolescents with a section on marijuanause. The first item is “Have you ever tried mari-juana?” If the response is “No,” the participant isdirected to skip items on recent and lifetime use andproceed to the next section. Many researchers wouldnot consider the skipped items to be missing, becausenever having used marijuana logically implies no re-cent or lifetime use. In the presence of response error,however, it might be a mistake to presume that allskipped items are zero, because some answers to theinitial question may be incorrect.

Interesting issues arise in longitudinal studies inwhich unfortunate events preclude measurement. Ifparticipants die, should we consider their characteris-tics (e.g., mental functioning) at subsequent occasionsto be missing? Some might balk at the idea, but insome contexts it is quite reasonable. If deaths are rareand largely unrelated to the phenomena of interest,then we may want to estimate parameters for an idealscenario in which no one dies during the study. Atother times, we may want to describe the character-istics of live participants only and perhaps performadditional analyses with mortality itself as the out-come. Even then, however, it is sometimes convenientto posit the existence of missing values for the de-ceased purely as a computational device, to permit theuse of missing-data algorithms. These issues are clari-fied later, after we review the notion of MAR.

Missing values are part of the more general conceptof coarsened data, which includes numbers that havebeen grouped, aggregated, rounded, censored, or trun-cated, resulting in partial loss of information (Heitjan& Rubin, 1991). Latent variables, a concept familiarto psychologists, are also closely related to missingdata. Latent variables are unobservable quantities

(e.g., intelligence, assertiveness) that are only imper-fectly measured by test or questionnaire items. Com-putational methods for missing data may simplify pa-rameter estimation in latent-variable models; a goodexample is the expectation-maximization (EM) algo-rithm for latent class analysis (Clogg & Goodman,1984).

Psychologists have sometimes made a distinctionbetween missing values on independent variables(predictors) and missing values on dependent vari-ables (outcomes). From our perspective, these two donot fundamentally differ. It is true that, under certainassumptions, missing values on a dependent variablemay be efficiently handled by a very simple methodsuch as case deletion, whereas good missing-data pro-cedures for independent variables can be more diffi-cult to implement. We discuss these matters later aswe review specific classes of missing-data proce-dures. However, we caution our readers not to believegeneral statements such as, “Missing values on a de-pendent variable can be safely ignored,” because suchstatements are imprecise and generally false.

Historical Development

Until the 1970s, missing values were handled pri-marily by editing. Rubin (1976) developed a frame-work of inference from incomplete data that remainsin use today. The formulation of the EM algorithm(Dempster, Laird, & Rubin, 1977) made it feasible tocompute ML estimates in many missing-data prob-lems. Rather than deleting or filling in incompletecases, ML treats the missing data as random variablesto be removed from (i.e., integrated out of) the like-lihood function as if they were never sampled. Weelaborate on this point later after introducing the no-tion of MAR. Many examples of EM were describedby Little and Rubin (1987). Their book also docu-mented the shortcomings of case deletion and singleimputation, arguing for explicit models over informalprocedures. About the same time, Rubin (1987) intro-duced the idea of MI, in which each missing value isreplaced with m > 1 simulated values prior to analysis.Creation of MIs was facilitated by computer technol-ogy and new methods for Bayesian simulation dis-covered in the late 1980s (Schafer, 1997). ML and MIare now becoming standard because of implementa-tions in free and commercial software.

The 1990s have seen many new developments. Re-weighting, long used by survey methodologists, hasbeen proposed for handling missing values in regres-sion models with missing covariates (Ibrahim, 1990).

SCHAFER AND GRAHAM148

New lines of research focus on how to handle missingvalues while avoiding the specification of a full para-metric model for the population (Robins, Rotnitzky,& Zhao, 1994). New methods for nonignorable mod-eling, in which the probabilities of nonresponse areallowed to depend on the missing values themselves,are proliferating in biostatistics and public health. Theprimary focus of these nonignorable models is drop-out in clinical trials, in which participants may beleaving the study for reasons closely related to theoutcomes being measured (Little, 1995). Researchersare now beginning to assess the sensitivity of resultsto alternative hypotheses about the distribution ofmissingness (Verbeke & Molenberghs, 2000).

Goals and Criteria

With or without missing data, the goal of a statis-tical procedure should be to make valid and efficientinferences about a population of interest—not to es-timate, predict, or recover missing observations nor toobtain the same results that we would have seen withcomplete data. Attempts to recover missing valuesmay impair inference. For example, the commonpractice of mean substitution—replacing each missingvalue for a variable with the average of the observedvalues—may accurately predict missing data but dis-tort estimated variances and correlations. A missing-value treatment cannot be properly evaluated apartfrom the modeling, estimation, or testing procedure inwhich it is embedded.

Basic criteria for evaluating statistical procedureshave been established by Neyman and Pearson (1933)and Neyman (1937). Let Q denote a generic popula-tion quantity to be estimated, and let Q denote anestimate of Q based on a sample of data. If the samplecontains missing values, then the method for handlingthem should be considered part of the overall proce-dure for calculating Q. If the procedure works well,then Q will be close to Q, both on average over re-peated samples and for any particular sample. That is,we want the bias—the difference between the averagevalue of Q and the true Q—to be small, and we alsowant the variance or standard deviation of Q to besmall. Bias and variance are often combined into asingle measure called mean square error, which is theaverage value of the squared distance (Q ! Q)2 overrepeated samples. The mean square error is equal tothe squared bias plus the variance.

Bias, variance, and mean square error describe thebehavior of an estimate, but we also want honesty inthe measures of uncertainty that we report. A reported

standard error SE(Q) should be close to the true stan-dard deviation of Q. A procedure for confidence in-tervals—for example, Q ± 2SE(Q) for a 95% inter-val—should cover the true Q with probability close tothe nominal rate. If the coverage rate is accurate, theprobability of Type I error (wrongly rejecting a truenull hypothesis) will also be accurate. Subject to cor-rect coverage, we also want the intervals to be narrow,because shorter intervals will reduce the rate of TypeII error (failure to accept a true alternative hypothesis)and increase power.

When missing values occur for reasons beyond ourcontrol, we must make assumptions about the pro-cesses that create them. These assumptions are usu-ally untestable. Good science suggests that assump-tions be made explicit and the sensitivity of results todepartures be investigated. One hopes that similarconclusions will follow from a variety of realistic al-ternative assumptions; when that does not happen, thesensitivity should be reported.

Finally, one should avoid tricks that apparentlysolve the missing-data problem but actually redefinethe parameters or the population. For example, con-sider a linear regression of Y on X, where the predictorX is sometimes missing. Suppose we replace the miss-ing values by an arbitrary number (say, zero) andintroduce a dummy indicator Z that is one if X ismissing and zero if X is observed. This proceduremerely redefines the coefficients. In the originalmodel E(Y) ! !0 + !1X, where E represents expectedvalue, !0 and !1 represent the intercept and slope forthe full population; in the expanded model E(Y ) ! !0

+ !1X + !2Z, !0 and !1 represent the intercept andslope for respondents, and !0 + !2 represents themean of Y among nonrespondents. For another ex-ample, suppose that missing values occur on a nomi-nal outcome with response categories 1, 2, . . . , k.One could treat the missing value as category k + 1,but that merely redefines categories 1, . . . , k to applyonly to respondents.

Types and Patterns of Nonresponse

Survey methodologists have historically distin-guished unit nonresponse, which occurs when the en-tire data collection procedure fails (because thesampled person is not at home, refuses to participate,etc.), from item nonresponse, which means that partialdata are available (i.e., the person participates butdoes not respond to certain individual items). Surveystatisticians have traditionally handled unit nonre-sponse by reweighting and item nonresponse by

MISSING DATA 149

single imputation. These older methods, which arereviewed in this article, may perform reasonably wellin some situations, but more modern procedures (e.g.,ML and MI) exhibit good behavior more generally.

In longitudinal studies, participants may be presentfor some waves of data collection and missing forothers. This kind of missingness may be called wavenonresponse. Attrition, or dropout, which is a specialcase of wave nonresponse, occurs when one leavesthe study and does not return. Overall, dropout orattrition may be the most common type of wave non-response. However, it is not uncommon for partici-pants to be absent from one wave and subsequentlyreappear. Because repeated measurements on an indi-vidual tend to be correlated, we recommend proce-dures that use all the available data for each partici-pant, because missing information can then bepartially recovered from earlier or later waves. Lon-gitudinal modeling by ML can be a highly efficientway to use the available data. MI of missing responsesis also effective if we impute under a longitudinalmodel that borrows information across waves.

Many data sets can be arranged in a rectangular ormatrix form, where the rows correspond to observa-tional units or participants and the columns corre-spond to items or variables. With rectangular data,there are several important classes of overall missing-data patterns. Consider Figure 1a, in which missingvalues occur on an item Y but a set of p other itemsX1, . . . , Xp is completely observed; we call this aunivariate pattern. The univariate pattern is alsomeant to include situations in which Y represents agroup of items that is either entirely observed or en-tirely missing for each unit. In Figure 1b, items oritem groups Y1, . . . , Yp may be ordered in such a waythat if Yj is missing for a unit, then Yj+1, . . . , Yp aremissing as well; this is called a monotone pattern.

Monotone patterns may arise in longitudinal studieswith attrition, with Yj representing variables collectedat the jth occasion. Figure 1c shows an arbitrary pat-tern in which any set of variables may be missing forany unit.

The Distribution of Missingness

For any data set, one can define indicator variablesR that identify what is known and what is missing. Werefer to R as the missingness. The form of the miss-ingness depends on the complexity of the pattern. InFigure 1a, R can be a single binary item for each unitindicating whether Y is observed (R ! 1) or missing(R ! 0). In Figure 1b, R can be a integer variable (1,2, . . . , p) indicating the highest j for which Yj is ob-served. In Figure 1c, R can be a matrix of binaryindicators of the same dimension as the data matrix,with elements of R set to 1 or 0 according to whetherthe corresponding data values are observed or miss-ing.

In modern missing-data procedures missingness isregarded as a probabilistic phenomenon (Rubin,1976). We treat R as a set of random variables havinga joint probability distribution. We may not have tospecify a particular distribution for R, but we mustagree that it has a distribution. In statistical literature,the distribution of R is sometimes called the responsemechanism or missingness mechanism, which may beconfusing because mechanism suggests a real-worldprocess by which some data are recorded and othersare missed. To describe accurately all potential causesor reasons for missingness is not realistic. The distri-bution of R is best regarded as a mathematical deviceto describe the rates and patterns of missing valuesand to capture roughly possible relationships betweenthe missingness and the values of the missing itemsthemselves. To avoid suggestions of causality, we

Figure 1. Patterns of nonresponse in rectangular data sets: (a) univariate pattern, (b) mono-tone pattern, and (c) arbitrary pattern. In each case, rows correspond to observational unitsand columns correspond to variables.


therefore refer to the probability distribution for R asthe distribution of missingness or the probabilities ofmissingness.

Missing at Random

Because missingness may be related to the data, weclassify distributions for R according to the nature ofthat relationship. Rubin (1976) developed a typologyfor these distributions that is widely cited but lesswidely understood. Adopting a generic notation, let usdenote the complete data as Ycom and partition it asYcom ! (Yobs, Ymis), where Yobs and Ymis are the ob-served and missing parts, respectively. Rubin (1976)defined missing data to be MAR if the distribution ofmissingness does not depend on Ymis,

P(R|Ycom) ! P(R|Yobs). (1)

In other words, MAR allows the probabilities of miss-ingness to depend on observed data but not on missingdata.1 An important special case of MAR, called miss-ing completely at random (MCAR), occurs when thedistribution does not depend on Yobs either,

P(R|Ycom) ! P(R).

When Equation 1 is violated and the distribution de-pends on Ymis, the missing data are said to be missingnot at random (MNAR). MAR is also called ignor-able nonresponse, and MNAR is called nonignorable.

For intuition, it helps to relate these definitions tothe patterns in Figure 1. Consider the univariate pat-tern of Figure 1a, where variables X ! (X1, . . . , Xp)are known for all participants but Y is missing forsome. If participants are independently sampled fromthe population, then MCAR, MAR, and MNAR havesimple interpretations in terms of X and Y: MCARmeans that the probability that Y is missing for aparticipant does not depend on his or her own valuesof X or Y (and, by independence, does not depend onthe X or Y of other participants either), MAR meansthat the probability that Y is missing may depend on Xbut not Y, and MNAR means that the probability ofmissingness depends on Y. Notice that under MAR,there could be a relationship between missingness andY induced by their mutual relationships to X, but theremust be no residual relationship between them once Xis taken into account. Under MNAR, some residualdependence between missingness and Y remains afteraccounting for X.

Notice that Rubin’s (1976) definitions describe sta-tistical relationships between the data and the miss-

ingness, not causal relationships. Because we oftenconsider real-world reasons why data become miss-ing, let us imagine that one could code all the myriadreasons for missingness into a set of variables. Thisset might include variables that explain why someparticipants were physically unable to show up (age,health status), variables that explain the tendency tosay “I don’t know” or “I’m not sure” (cognitive func-tioning), variables that explain outright refusal (con-cerns about privacy), and so on. These causes of miss-ingness are not likely to be present in the data set, butsome of them are possibly related to X and Y and thusby omission may induce relationships between X or Yand R. Other causes may be entirely unrelated to Xand Y and may be viewed as external noise. If we letZ denote the component of cause that is unrelated toX and Y, then MCAR, MAR, and MNAR may berepresented by the graphical relationships in Figure 2.MCAR requires that the causes of missingness beentirely contained within the unrelated part Z, MARallows some causes to be related to X, and MNARrequires some causes to be residually related to Y afterrelationships between X and R are taken into account.

If we move from the univariate pattern of Figure 1ato the monotone pattern of Figure 1b, MCAR meansthat Yj is missing with probability unrelated to anyvariables in the system; MAR means that it may berelated only to Y1, . . . , Yj!1; and MNAR means that itis related to Yj, . . . , Yp. If Y1, . . . , Yp are repeatedmeasurements of an outcome variable in a longitudi-nal study, and missing data arise only from attrition,then MCAR requires dropout to be independent ofresponses at every occasion, MAR allows dropout todepend on responses at any or all occasions prior todropout, and MNAR means that it depends on theunseen responses after the participant drops out. Inthis specialized setting, MAR has been called nonin-formative or ignorable dropout, whereas MNAR iscalled informative (Diggle & Kenward, 1994).

1 In Rubin’s (1976) definition, Equation 1 is not requiredto hold for all possible values of R, but only for the R thatactually appeared in the sample. This technical point clari-fies certain issues. For example, suppose that an experimentproduced no missing values even though it could have. Inthat case, Equation 1 would hold because Ymis is empty, andRubin’s (1976) results indicate that one should simply ana-lyze the complete data without worrying about the fact thatmissing values could have arisen in hypothetical repetitionsof the experiment.

MISSING DATA 151

With the arbitrary pattern of Figure 1c, MCAR stillrequires independence between missingness andY1, . . . , Yp. However, MAR is now more difficult tograsp. MAR means that a participant’s probabilities ofresponse may be related only to his or her own set ofobserved items, a set that may change from one par-ticipant to another. One could argue that this assump-tion is odd or unnatural, and in many cases we areinclined to agree. However, the apparent awkward-ness of MAR does not imply that it is far from true.Indeed, in many situations, we believe that MAR isquite plausible, and the analytic simplifications thatresult from making this assumption are highly benefi-cial.

In discussions with researchers, we have found thatmost misunderstandings of MCAR, MAR, andMNAR arise from common notions about the mean-ing of random. To a statistician, random suggests aprocess that is probabilistic rather than deterministic.In that sense, MCAR, MAR, and MNAR are all ran-dom, because they all posit probability distributionsfor R. To a psychologist, random may suggest a pro-cess that is unpredictable and extraneous to variablesin the current study (e.g., tossing a coin or rolling adie), a notion that agrees more closely with MCARthan with MAR. In retrospect, Rubin’s (1976) choiceof terminology seems a bit unfortunate, but theseterms are now firmly established in the statistical lit-erature and are unlikely to change.

The Plausibility of MAR

In certain settings, MAR is known to hold. Theseinclude planned missingness in which the missingdata were never intended to be collected in the firstplace: cohort-sequential designs for longitudinal stud-ies (McArdle & Hamagami, 1991; Nesselroade &Baltes, 1979) and the use of multiple questionnaireforms containing different subsets of items (Graham,

Hofer, & Piccinin, 1994; Graham, Hofer, & MacKin-non, 1996). Planned missingness in a study may haveimportant advantages in terms of efficiency and cost(Graham, Taylor, & Cumsille, 2001). Planned missingvalues are usually MCAR, but MAR situations some-times arise—for example, if participants are includedin a follow-up measure only if their pretest scoresexceed a cutoff value. Latent variables are missingwith probability one and are therefore also known tobe MAR.

When missingness is beyond the researcher’s con-trol, its distribution is unknown and MAR is only anassumption. In general, there is no way to test whetherMAR holds in a data set, except by obtaining follow-up data from nonrespondents (Glynn, Laird, & Rubin,1993; Graham & Donaldson, 1993) or by imposing anunverifiable model (Little & Rubin, 1987, chapter11). In most cases we should expect departures fromMAR, but whether these departures are seriousenough to cause the performance of MAR-basedmethods to be seriously degraded is another issue en-tirely (Graham, Hofer, Donaldson, MacKinnon, &Schafer, 1997). Recently, Collins, Schafer, and Kam(2001) demonstrated that in many realistic cases, anerroneous assumption of MAR (e.g., failing to takeinto account a cause or correlate of missingness) mayoften have only a minor impact on estimates and stan-dard errors.

Example

Suppose that systolic blood pressures of N partici-pants are recorded in January (X ). Some of them havea second reading in February (Y ), but others do not.Table 1 shows simulated data for N ! 30 participantsdrawn from a bivariate normal population with means"x ! "y ! 125, standard deviations #x ! #y ! 25,and correlation $ !.60. The first two columns of the

Figure 2. Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c)missing not at random (MNAR) in a univariate missing-data pattern. X represents variables that are completely observed, Yrepresents a variable that is partly missing, Z represents the component of the causes of missingness unrelated to X and Y, andR represents the missingness.


table show the complete data for X and Y. The othercolumns show the values of Y that remain after im-posing missingness by three methods. In the firstmethod, the 7 measured in February were randomlyselected from those measured in January; this mecha-nism is MCAR. In the second method, those whoreturned in February did so because their January

measurements exceeded 140 (X > 140), a level usedfor diagnosing hypertension; this is MAR but notMCAR. In the third method, those recorded in Feb-ruary were those whose February measurements ex-ceeded 140 (Y > 140). This could happen, for ex-ample, if all individuals returned in February, but thestaff person in charge decided to record the Februaryvalue only if it was in the hypertensive range. Thisthird mechanism is an example of MNAR. (OtherMNAR mechanisms are possible; e.g., the Februarymeasurement may be recorded only if it is substan-tially different from the January reading.) Notice thatas we move from MCAR to MAR to MNAR, theobserved Y values become an increasingly select andunusual group relative to the population; the samplemean increases, and the standard deviation decreases.This phenomenon is not a universal feature of MCAR,MAR, and MNAR, but it does happen in many real-istic examples.

We frequently return to this example throughoutthis article to illustrate the performance of variousmethods. Because the rate of missing values is high,the chosen method will exert a high degree of influ-ence over the results, and differences among compet-ing methods will be magnified. Effects will also belarge because of the unusually strong nature of theMAR and MNAR processes. In psychological studies,one would rarely expect a datum to be missing if andonly if its value exceeds a sharp cutoff. More com-monly, one might expect a gradual increasing or de-creasing or perhaps curvilinear relationship between Xor Y and the probability of missingness. Also, in mostcases, the reasons or causes of missingness are not Xand Y themselves but external factors that are merelyrelated to X and Y, perhaps not strongly. Differencesin results due to varying missing-data treatments inthis example should therefore be regarded as moreextreme than what we usually see in practice.

We would ideally like to have a single procedurefor estimating all the parameters ("X, "Y, #X, #Y, $)from the observed data that performs well over re-peated samples regardless of how missingness is dis-tributed. Technically speaking, this is not possible.We see below that some data editing methods performwell for some parameters under MCAR and occasion-ally MAR. Using likelihood or Bayesian procedures,one may achieve good performance for all parameterswithout knowing the distribution of missingness, pro-vided that it is MAR. Under MNAR data the taskbecomes more difficult; one must specify a model forthe missingness that is at least approximately correct,

Table 1Simulated Blood Pressure Measurements (N = 30Participants) in January (X) and February (Y) WithMissing Values Imposed by Three Different Methods

X

Y

Complete MCAR MAR MNAR

Data for individual participants

169 148 148 148 148126 123 — — —132 149 — — 149160 169 — 169 169105 138 — — —116 102 — — —125 88 — — —112 100 — — —133 150 — — 150

94 113 — — —109 96 — — —109 78 — — —106 148 — — 148176 137 — 137 —128 155 — — 155131 131 — — —130 101 101 — —145 155 — 155 155136 140 — — —146 134 — 134 —111 129 — — —

97 85 85 — —134 124 124 — —153 112 — 112 —118 118 — — —137 122 122 — —101 119 — — —103 106 106 — —

78 74 74 — —151 113 — 113 —

Summary data: Mean (with standard deviationin parentheses)

125.7 121.9 108.6 138.3 153.4(23.0) (24.7) (25.1) (21.1) (7.5)

Note. Dashes indicate missing values. MCAR ! missing com-pletely at random; MAR ! missing at random; MNAR ! missingnot at random.

MISSING DATA 153

and even then the performance may be poor unless thesample is very large. However, even though there isno single procedure that is technically correct in everysituation, all is not lost. We do believe that goodperformance is often achievable through likelihood orBayesian methods without specifically modeling theprobabilities of missingness, because in many psycho-logical research settings the departures from MAR areprobably not serious.

Implications of MAR, MCAR, and MNAR

With a complete-data Ycom, statistical methods areusually motivated by an assumption that the dataare randomly sampled from a population distributionP(Ycom; %), where % represents unknown parameters.To a statistician, P(Ycom; %) has two very differentinterpretations. First, it is regarded as the repeated-sampling distribution for Ycom; it describes the prob-ability of obtaining any specific data set among all thepossible data sets that could arise over hypotheticalrepetitions of the sampling procedure and data collec-tion. Second, it is regarded as a likelihood function for%. In the likelihood interpretation, we substitute therealized value of Ycom into P(Ycom; %), and the result-ing function for % summarizes the data’s evidenceabout parameters. Certain classes of modern statisticalprocedures, including all ML methods, are motivatedby viewing P(Ycom; %) as a likelihood function. Otherprocedures, including nonparametric and semipara-metric methods, are justified only from the repeated-sampling standpoint.

When portions of Ycom are missing, it is tempting tobase all statistical procedures on P(Yobs; %), the dis-tribution of the observed part only. This distribution isobtained by calculus as the definite integral of P(Ycom;%) with respect to Ymis,

P!Yobs; %) = " P!Ycom; %# dYmis. (2)

Details and examples of integrating probability distri-butions are given in texts on probability theory (e.g.,Hogg & Tanis, 1997). In missing-data problems, how-ever, it is not automatically true that Equation 2 is thecorrect sampling distribution for Yobs and the correctlikelihood for % based on Yobs. Rubin (1976) first iden-tified the conditions under which it is a proper sam-pling distribution and a proper likelihood; interest-ingly, the conditions are not identical. For Equation 2to be a correct sampling distribution, the missing datashould be MCAR. For Equation 2 to be a correctlikelihood, we need only MAR. The weaker condi-

tions for the latter suggest that missing-data proce-dures based on likelihood principles are generallymore useful than those derived from repeated-sampling arguments only. We believe that to be true.Many of the older data-editing procedures bear norelationship to likelihood and may be valid only underMCAR. Even when MCAR does hold, these methodsmay be inefficient. Methods motivated by treatingEquation 2 as a likelihood tend to be more powerfuland better suited to real-world applications in whichMCAR is often violated.

Finally, we note that the attractive properties oflikelihood carry over to the Bayesian method of MI,because in the Bayesian paradigm we combine a like-lihood function with a prior distribution for the pa-rameters. As the sample size grows, the likelihooddominates the prior, and Bayesian and likelihood an-swers become similar (Gelman, Rubin, Carlin, &Stern, 1995).

Missing Values That Are Not MAR

What happens when the missing data are not MAR?It is then not appropriate to use Equation 2 either as asampling distribution or as a likelihood. From a like-lihood standpoint, the correct way to proceed is tochoose an explicit model for the missingness,P(R|Ycom; &), where & denotes unknown parameters ofthe missingness distribution. For example, if missing-ness is confined to a single variable, then we maysuppose that the binary indicators in R are describedby a logistic regression on the variable in question,and & would consist of an intercept and slope. Thejoint model for the data and missingness becomes theproduct of P(Ycom; %) and P(R|Ycom; &). The correctlikelihood function is then given by the integral of thisproduct over the unseen missing values,

P!Yobs, R; %, &# = "P!Ycom; %# P!R|Ycom; &# dYmis,

(3)where d is the calculus differential. The practical im-plication of MNAR is that the likelihood for % nowdepends on an explicit model for R. In most cases, thismissingness model is a nuisance; questions of sub-stantive interest usually pertain to the distribution ofYcom, not the distribution of R. Nevertheless, underMNAR the model for R contributes information about%, and the evidence about % from Equation 3 maypresent a very different picture from that given byEquation 2. Likelihoods for MNAR models are oftendifficult to handle from a computational standpoint,but some interesting work in this area has been pub-


lished recently. Methods for MNAR data are re-viewed at the end of this article.

Missing Values That Are Out of Scope

In addition to MAR, there is another situation inwhich Equation 2 is an appropriate likelihood: whenthe fact that an observation is missing causes it toleave the universe of interest. Consider a question-naire with an item, “How well do you get along withyour siblings?” Responses for some participants aremissing because they have no siblings. Literallyspeaking, there are no missing data in this problem atall. However, if the intended analysis could be carriedout more simply if the data were balanced (i.e., ifresponses to this item were available for each partici-pant), then for computational reasons it may be worth-while to write the likelihood P(Yobs; %) in the form ofEquation 2, where Yobs denotes responses for thosewho have siblings and Ymis represents hypotheticalresponses for those who do not. In this case, the hy-pothetical missing data could be regarded as MARand % would be the parameters of the distribution ofresponses for the population of those with siblings.We need not worry about whether missingness de-pends on the characteristics of nonexistent siblings;these missing values are introduced merely as a math-ematical device to simplify the computations.

To a certain extent, this discussion also applies tolongitudinal studies in which some participants die.Outcome measures of physical or mental status (e.g.,cognitive functioning) have meaning for live personsonly. One who dies automatically leaves the universeof interest, so values that are “missing” because ofdeath may often be regarded as MAR. However, if themeasurements of these outcomes are spaced far apartin time—for example, if they are taken annually—then the death of a participant may provide indirectevidence of an unmeasured steep decline in the out-come prior to death, and response trajectories esti-mated under an MAR assumption may be somewhatoptimistic. In those cases, joint modeling of the out-come and death events may be warranted (Hogan &Laird, 1997).

Older Methods

Case Deletion

Among older methods for missing data, the mostpopular is to discard units whose information is in-complete. Case deletion, also known commonly aslistwise deletion (LD) and complete-case analysis, is

used by default in many statistical programs, but de-tails of its implementation vary. LD confines attentionto units that have observed values for all variablesunder consideration. For example, suppose we arecomputing a sample covariance matrix for itemsX1, . . . , Xp. LD omits from consideration any casethat has a missing value on any of the variablesX1, . . . , Xp.

Available-case (AC) analysis, in contrast to LD,uses different sets of sample units for different pa-rameters. For estimating covariances, this is some-times called pairwise deletion or pairwise inclusion.For example, we may use every observed value of Xj

to estimate the standard deviation of Xj, and everyobserved pair of values (Xj, Xk) to estimate the co-variance of Xj and Xk. For the correlation between Xj

and Xk, we might compute the sample correlation co-efficient using the same set of units that we used toestimate the covariance. On the other hand, we couldalso divide our estimated covariance by the estimatedstandard deviations. The latter seems more efficient,but it could conceivably yield a correlation outside ofthe interval [!1, 1], causing one or more eigenvaluesto be negative. We believe that the underlying prin-ciple of AC analysis—to make use of all the availabledata—is eminently sensible, but deleting cases is apoor way to operationalize it. Another limitation ofAC analysis is that, because parameters are estimatedfrom different sets of units, it is difficult to computestandard errors or other measures of uncertainty; ana-lytic methods are troublesome, and other procedures(e.g., bootstrapping) are at least as tedious as theywould be for other estimates with better properties.

Properties of case deletion. Case deletion can bemotivated by viewing Equation 2 as a sampling dis-tribution for Yobs and is generally valid only underMCAR. In a few circumstances, it produces infer-ences that are optimal under MAR. For example, un-der the univariate missingness pattern of Figure 1a,the parameters of the regression of Y on any subset ofX1, . . . , Xp can be estimated from the complete casesand the estimates are both valid and efficient underMAR (e.g., see Graham & Donaldson, 1993). How-ever, this result does not extend to other measures ofassociation between Y and X such as correlation co-efficients, nor does it extend to parameters of the mar-ginal distribution of Y. When the missing data are notMCAR, results from case deletion may be biased,because the complete cases can be unrepresentative ofthe full population. If the departures from MCAR arenot serious, then the impact of this bias might be

MISSING DATA 155

unimportant, but in practice it can be difficult to judgehow large the biases might be.

When MCAR holds, case deletion can still be in-efficient. Consider again the univariate pattern in Fig-ure 1a, and suppose that we want to estimate aspectsof the marginal distribution of Y (e.g., the populationmean). Furthermore, suppose that Y and the Xs arehighly related, so that the missing values of Y can bepredicted from X with near certainty. Case deletionbases estimates on the reduced sample of Y values,ignoring the strong predictive information containedin the Xs. Researchers become acutely aware of theinefficiency of case deletion in multivariate analysesinvolving many items, in which mild rates of missingvalues on each item may cause large portions of thesample to be discarded.

The main virtue of case deletion is simplicity. If amissing-data problem can be resolved by discardingonly a small part of the sample, then the method canbe quite effective. However, even in that situation,one should explore the data to make sure that thediscarded cases are not unduly influential. In a studyof a rare disease or condition, for example, one shouldverify that the small group being discarded does notcontain a large proportion of the participants possess-ing that condition. For more discussion on the prop-erties of case deletion and further references, seechapter 3 of Little and Rubin (1987).

Example. For the systolic blood pressure datashown in Table 1, LD removes all participants whoseY values are missing. To demonstrate the properties ofLD over repeated samples, we performed a simulationexperiment. One thousand samples were drawn fromthe bivariate normal population, and missing valueswere imposed on each sample by each of the threemechanisms. For MAR and MNAR, Y was mademissing if X ' 140 and Y ' 140, respectively, pro-ducing an average rate of 73% missing observationsfor Y. For MCAR, each Y value was made missingwith probability .73. The total number of participantsin each sample was increased to 50 to ensure that,after case deletion, a sufficient number remained tosupport the estimation of correlation and regressioncoefficients. After deletion, standard techniques wereused to calculate estimates and 95% intervals for fivepopulation parameters: the mean of Y ("Y ! 125), thestandard deviation of Y (#Y ! 25), the correlationcoefficient ($XY ! .60), the slope for the regression ofY on X (!Y|X ! .60), and the slope for the regressionof X on Y (!X|Y ! .60). For #Y, the confidence inter-val consisted of the square roots of the endpoints for

the classical interval for the variance of a normalpopulation. For $, the interval was calculated by ap-plying Fisher’s transformation z ! tanh!1 (r) to thesample correlation r, adding and subtracting 1.96 (N !3)!1/2, and applying the inverse transformation r !tanh (z) to the endpoints.

Results of the simulation are summarized in Table2. The top panel of the table reports the average val-ues of the parameter estimates under each mechanism.A discrepancy between this average and the true pa-

Table 2Performance of Listwise Deletion for ParameterEstimates and Confidence Intervals Over 1,000 Samples(N = 50 Participants)

Parameter MCAR MAR MNAR

Average parameter estimate (with RMSE in parentheses)

"Y ! 125.0 125.0 143.3 155.5(6.95) (19.3) (30.7)

#Y ! 25.0 24.6 20.9 12.2(5.26) (5.84) (13.2)

$ ! .60 .59 .33 .34(.19) (.37) (.36)

!Y|X ! .60 .61 .60 .21(.27) (.51) (.43)

!X|Y ! .60 .60 .20 .60(.25) (.44) (.52)

Coverage (with average interval width in parentheses)

"Y 94.3 18.8 0.0(30.0) (25.0) (14.7)

#Y 94.3 90.7 17.4(23.3) (19.4) (11.4)

$ 95.4 82.5 82.7(0.76) (0.93) (0.94)

!Y|X 94.6 95.9 40.0(1.10) (2.20) (0.73)

!X|Y 95.3 37.7 96.6(1.08) (0.71) (2.23)

Note. Parameters: " is the population mean; # is the populationstandard deviation; $ is the population correlation; ! is the popu-lation regression slope. Coverage represents the percentage of con-fidence intervals that include the parameter value; values near 95represent adequate coverage. Use of boldface type in the top panelindicates problematic levels of bias (i.e., bias whose absolute size isgreater than about one half of the estimate’s standard error); use ofboldface in the bottom panel indicates seriously low levels of cov-erage (i.e., coverage that falls below 90%, which corresponds to adoubling of the nominal rate of error). MCAR ! missing com-pletely at random; MAR ! missing at random; MNAR ! missingnot at random; RMSE ! root-mean-square error.


rameter indicates bias. A rule of thumb that we havefound useful is that bias becomes problematic if itsabsolute size is greater than about one half of theestimate’s standard error; above this threshold, biasbegins to noticeably degrade the coverage of 95%confidence intervals. Biases that are practically im-portant according to this rule are indicated by bold-face type. Some may disagree with this rule, becausethe standard error of the estimate depends on thesample size (N ! 50), which was chosen arbitrarily;those readers may compare the average of the param-eter estimate to the true value of the parameter andjudge for themselves whether the bias seems practi-cally important. The top panel of Table 2 also displaysthe root-mean-square error (RMSE), which measuresthe typical distance between the estimate and the truevalue. Small values of RMSE are desirable. Examin-ing these results, we find that LD is unbiased underMCAR. Under MAR and MNAR, the complete casesare unrepresentative of the population, and biases aresubstantial, with two exceptions: The estimate of !Y|X

is unbiased under MAR, and the estimate of !X|Y isunbiased under MNAR.

The bottom panel of Table 2 summarizes the per-formance of confidence intervals. For each parameterand each mechanism, we report the coverage—thepercentage of intervals that covered the true param-eter—and the average interval width. Narrow inter-vals are desirable provided that their coverage is near95%. As a rough rule of thumb, we consider the cov-erage to be seriously low if it falls below 90%, whichcorresponds to a doubling of the nominal rate of error;these values are indicated by boldface type. Examin-ing the bottom panel of Table 2, we see that coverageis acceptable for all parameters under MCAR but lowfor most parameters under MAR and MNAR. Under-coverage has two possible sources: bias, which causesthe interval on average to be centered to the left or tothe right of the target, and underestimation of theestimate’s true variability, which causes the intervalto be narrower than it should be. Under MAR andMNAR, both phenomena are occurring; the observedY values tend to be higher and less variable than thoseof the full population, which biases both the param-eter estimates and their standard errors.

This simulation clearly illustrates that case detec-tion may produce bias under non-MCAR conditions.Some might argue that these results are unrealisticbecause a missingness rate of 73% for Y is too high.However, it is not difficult to find published analysesin which 73% or more of the sample cases were omit-

ted because of incomplete information. On the basisof a survey of articles in major political science jour-nals, King, Honaker, Joseph, and Scheve (2001) re-ported alarmingly high rates of case deletion withserious implications for parameter bias and ineffi-ciency. Other simulations revealing the shortcomingsof case deletion have been reported by Brown (1994),Graham et al. (1996), and Wothke (2000).

Reweighting

In some non-MCAR situations, it is possible to re-duce biases from case deletion by the judicious appli-cation of weights. After incomplete cases are re-moved, the remaining complete cases are weighted sothat their distribution more closely resembles that ofthe full sample or population with respect to auxiliaryvariables. Weights are derived from the probabilitiesof response, which must be estimated from the data(e.g., by a logistic or probit regression). Weightingcan eliminate bias due to differential response relatedto the variables used to model the response probabili-ties, but it cannot correct for biases related to vari-ables that are unused or unmeasured. For a review ofweighting in the context of sample surveys, see Littleand Rubin (1987, section 4.4).

Weighting is nonparametric, requiring no model forthe distribution of the data values in the population. Itdoes, however, require some model for the probabili-ties of response. Weights are easy to apply for uni-variate and monotone missing-data patterns. For thearbitrary pattern of missing values shown in Figure1c, weighting becomes unattractive because one mustpotentially compute a different set of weights for eachvariable. Recent years have seen a resurgence of in-terest in weighting, with new methods for parametricand semiparametric regression appearing in biostatis-tics; some of these newer methods are reviewed nearthe end of this article.

Averaging the Available Items

Many characteristics of interest to psychologists—for example, self-esteem, depression, anxiety, qualityof life—cannot be reliably measured by a single item,so researchers may create a scale by averaging theresponses to multiple items. An average can be mo-tivated by the idea that the items are exchangeable,equally reliable measures of a unidimensional trait.The items are typically standardized to have a mean ofzero and a standard deviation of one before averaging.If a participant has missing values for one or moreitems, it seems more reasonable to average the items

MISSING DATA 157

that remain rather than report a missing value for theentire scale. This practice is widespread, but its prop-erties remain largely unstudied; it does not even havea well-recognized name. Undoubtedly, some re-searchers may have used this method without realiz-ing that it is a missing-data technique, choosing in-stead to regard it as part of the scale definition. Somemight call it case-by-case item deletion. A colleaguehas suggested the term ipsative mean imputation, be-cause it is equivalent to substituting the mean of aparticipant’s own observed items for each of his or hermissing items (H. B. Bosworth, personal communica-tion, February 2001).

Averaging the available items is difficult to justifytheoretically either from a sampling or likelihood per-spective. Unlike case deletion, it may introduce biasunder MCAR. For example, suppose that a scale isdefined as the average of six items, and we computean average for each participant who responds to atleast three. With missing data the variance of the scaletends to increase, because it becomes a mixture of theaverages of three, four, five, or six items rather thanthe average of all six. The scale also becomes lessreliable, because reliability decreases as the numberof items drops. The method also raises fundamentalconceptual difficulties. The scale has been redefinedfrom the average of a given set of items to the averageof the available items, a definition that now dependson the particular rates and patterns of nonresponse inthe current sample and that also varies from one par-ticipant to another. This violates the principle that anestimand be a well-defined aspect of a population, notan artifact of a specific data set.

Despite these theoretical problems, preliminary in-vestigations suggest that the method can be reason-ably well behaved. As an illustration, suppose that ascale is defined as the average of standardized itemsY1, . . . , Y4, but these items are not equally correlatedwith each other or with other items. In particular,suppose that the covariance matrix for Y1, . . . , Y4 is

∑ =!1.0 0.5 0.2 0.2

0.5 1.0 0.2 0.2

0.2 0.2 1.0 0.5

0.2 0.2 0.5 1.0" ,

and suppose that the correlation between these andanother standardized item X is .50 for Y1 and Y2 and.10 for Y3 and Y4. Under these circumstances, the trueslope for the regression of Y ! (Y1 + Y2 + Y3 + Y4)/4

on X is ! ! .30. Now suppose that Y1 and Y2 aremissing fairly often, whereas Y3 and Y4 are nearlyalways observed. We imposed missing values in anMCAR fashion at a rate of 30% for Y1 and Y2 and 5%for Y3 and Y4. We then averaged the items providedthat at least two of the four were available (partici-pants with fewer than two were deleted) and regressedthe scale on X. Over 1,000 samples of N ! 100 caseseach, the average value of the estimated slope was.26, and 903 of the nominal 95% confidence intervalscovered the population value ! ! .30. The bias isnoticeable but not dramatic. Modifying this example,we found that bias tends to decrease as the Yjs becomemore equally correlated with each other and with Xand as the intercorrelations among the Yjs increaseeven if they are unequal.

Newer methods of MI provide a more principledsolution to this problem. With MI, one imputes miss-ing items prior to forming scales. Using modern soft-ware, it is now routinely possible to impute 100 or soitems simultaneously and preserve the intercorrela-tions between the items, provided that enough samplecases are available to estimate the joint covariancestructure. If MI is not feasible, then averaging theavailable items may be a reasonable choice, especiallyif the reliability is high (say, ( > .70) and each groupof items to be averaged seems to form a single, well-defined domain.

Single Imputation

When a unit provides partial information, it istempting to replace the missing items with plausiblevalues and proceed with the desired analysis ratherthan discard the unit entirely. Imputation, the practiceof filling in missing items, has several desirable fea-tures. It is potentially more efficient than case dele-tion, because no units are sacrificed; retaining the fullsample helps to prevent loss of power resulting froma diminshed sample size. Moreover, if the observeddata contain useful information for predicting themissing values, an imputation procedure can makeuse of this information and maintain high precision.Imputation also produces an apparently complete dataset that may be analyzed by standard methods andsoftware. To a data user, the practical value of beingable to apply a favorite technique or software productcan be immense. Finally, when data are to be analyzedby multiple persons or entities, imputing once, prior toall analyses, helps to ensure that the same set of unitsis being considered by each entity, facilitating thecomparison of results. On the negative side, imputa-


tion can be difficult to implement well, particularly inmultivariate settings. Some ad hoc imputation meth-ods can distort data distributions and relationships.The shortcomings of single imputation have beendocumented by Little and Rubin (1987) and others.Here we briefly classify and review some popularsingle-imputation methods.

Imputing unconditional means. Consider thepopular practice of mean substitution, in which miss-ing values are replaced by the average of the observedvalues for that item. The average of the variable ispreserved, but other aspects of its distribution—variance, quantiles, and so forth—are altered with po-tentially serious ramifications. Consider a large-sample 95% confidence interval for the populationmean,

y " 1.96#S2

N,

where y and S2 are the sample mean and variance andN is the sample size. Mean substitution narrows thisinterval in two ways: by introducing a downward biasinto S2 and by overstating N. Under MCAR, the cov-erage probability after mean substitution is approxi-mately 2)(1.96r) ! 1, where ) is the standard normalcumulative distribution function and r is the responserate. With 25% missing values (r ! .75) the coveragedrops to 86%, and the error rate is nearly three timesas high as it should be. In addition to reducing vari-ances, the method also distorts covariances and inter-correlations between variables.

Imputing from unconditional distributions. Theidea underlying mean substitution—to predict themissing data values—is somewhat misguided; it isgenerally more desirable to preserve a variable’s dis-tribution. Survey methodologists, who have long beenaware of this, have developed a wide array of single-imputation methods that more effectively preservedistributional shape (Madow, Nisselson, & Olkin,1983). One popular class of procedures known as hotdeck imputation fills in nonrespondents’ data withvalues from actual respondents. In a simple univariatehot deck, we replace each missing value by a randomdraw from the observed values. Hot-deck imputationhas no parametric model. It partially solves the prob-lem of understating uncertainty, because the variabil-ity of the item is not distorted. However, without fur-ther refinements, the method still distorts correlationsand other measures of association.

Imputing conditional means. In the univariate

situation of Figure 1a, a regression model for predict-ing Y from X ! (X1, . . . , Xp) may provide a basis forimputation. The model is first fit to the cases forwhich Y is known. Then, plugging values of X for thenonrespondents into the regression equation, we ob-tain predictions Y for the missing values of Y. Replac-ing Y with Y is called conditional mean imputation,because Y estimates the conditional mean of Y givenX. Conditional mean imputation is nearly optimal fora limited class of estimation problems if special cor-rections are made to standard errors (Schafer &Schenker, 2000). The method is not recommended foranalyses of covariances or correlations, because itoverstates the strength of the relationship between Yand the X variables; the multiple regression R2 amongthe imputed values is 1.00. If there is no associationbetween Y and the covariates X, then the method re-duces to ordinary mean substitution.

Imputing from a conditional distribution. Distor-tion of covariances can be eliminated if each missingvalue of Y is replaced not by a regression predictionbut by a random draw from the conditional or predic-tive distribution of Y given X. With a standard linearmodel, we may add to Y a residual error drawn froma normal distribution with mean zero and varianceestimated by the residual mean square. In a logisticregression for a dichotomous Y, one may calculate thefitted probability p for each case, draw a random uni-form variate u, and set Y ! 1 if u ' p and Y ! 0 ifu > p.

More generally, suppose that we have data Ycom !(Yobs, Ymis) from distribution P (Yobs, Ymis; %). Imput-ing from the conditional distribution means simulat-ing a draw from

P!Ymis|Yobs; %# =P!Yobs, Ymis; %#

P!Yobs; %#, (4)

where the denominator is given by Equation 2. Inpractice, the parameters are unknown and must beestimated, in which case we would draw fromP(Ymis|Yobs; %), where % is an estimate of % obtainedfrom Yobs. Imputing from Equation 4 assumes MAR.The method produces nearly unbiased estimates formany population quantities under MAR if the model(Equation 4) is correctly specified. Formulating theconditional distribution and drawing from it tends tobe easiest for univariate missing-data patterns. With amonotone pattern, the conditional distribution can beexpressed as a sequence of regressions for Yj givenY1, . . . , Yj!1 for j ! 1, . . . , p, which is not too dif-ficult. With the arbitrary patterns, the conditional dis-

MISSING DATA 159

tribution can be quite complicated, and drawing fromit may require nearly as much effort as full MI, whichhas superior properties.

Example. Consider the problem of imputing themissing blood pressure readings in Table 1. For theMAR condition, we imputed the missing values of Yby four methods: (a) performing mean substitution,(b) using a simple hot deck, (c) performing condi-tional mean imputation based on the linear regressionof Y on X, and (d) drawing from the estimated pre-dictive distribution of Y given X based on the sameregression. Bivariate scatter plots of Y versus X for thefour imputation methods are displayed in Figure 3.These plots clearly reveal the shortcomings of the firstthree methods. Because the variables are jointly nor-mal with correlation .60, a plot of complete datashould resemble an elliptical cloud with a moderatepositive slope. Mean substitution (see Figure 3a)causes all the imputed values of Y to fall on a hori-zontal line, whereas conditional mean imputation (seeFigure 3c) causes them to fall on a regression line.The hot deck (see Figure 3b) produces an ellipticalcloud with too little correlation. The only method that

produces a reasonable point cloud is (see Figure 3d)imputation from the conditional distribution of Y and X.

To illustrate the operating characteristics of thesemethods, we performed a simulation similar to theone we did for case deletion. One thousand samples ofN ! 50 were drawn from the bivariate normal popu-lation, and missing values were imposed on eachsample by the MCAR, MAR, and MNAR methods.Each incomplete data set was then imputed by thefour methods shown in Figure 3. The results are sum-marized in Table 3. As before, the top panel of thetable reports the average of the parameter estimatesand the RMSE for each mechanism–imputationmethod combination; estimates with substantial biasare displayed in boldface. Coverage and averagewidth of the nominal 95% confidence intervals areshown in the bottom panel, with seriously low cover-ages in boldface. On the basis of these results in thetop panel, we make the following general observa-tions. Mean substitution and the hot deck producebiased estimates for many parameters under any typeof missingness. Conditional mean imputation per-forms slightly better but still may introduce bias. Im-

Figure 3. Example scatter plots of blood pressure measured at two occasions, with missingat random missing values imputed by four methods.


puting from a conditional distribution is essentiallyunbiased under MCAR or MAR but potentially biasedunder MNAR.

The problem of undercoverage. Examining thebottom panel of Table 3, we find that the performanceof estimated intervals is a disaster. In nearly all cases,the actual coverage is much lower than 95%. (Theonly exceptions to this rule occur for !X|Y under meansubstitution, which causes the intervals to be excep-tionally wide). This shortcoming of single imputationis well documented (Rubin, 1987). Even if it pre-serves marginal and joint distributions, it still tends tounderstate levels of uncertainty, because conventionaluncertainty measures ignore the fact the imputed val-ues are only guesses. With single imputation, there is

no simple way to reflect missing-data uncertainty. Inthis example, the implications are very serious be-cause the amount of missing information is unusuallyhigh, but even in less extreme situations undercover-age is still an issue. The problem of undercoverage issolved by MI, to be described later.

When single imputation is reasonable. Despitethe discouraging news in the bottom panel of Table 3,there are situations in which single imputation is rea-sonable and better than case deletion. Imagine a dataset with 25 variables in which 3% of all the datavalues are missing. If missing values are spread uni-formly across the data matrix, then LD discards morethan half of the participants (1 ! .9725 ! .53). On theother hand, imputing once from a conditional distri-

Table 3Performance of Single-Imputation Methods for Parameter Estimates and Confidence Intervals Over 1,000 Samples(N = 50 Participants)

Parameter

MCAR MAR MNAR

MS HD CM PD MS HD CM PD MS HD CM PD


"Y ! 125.0 125.1 125.2 125.2 125.1 143.5 143.5 124.9 124.8 155.5 155.5 151.6 151.6(7.18) (7.89) (6.26) (6.57) (19.4) (19.5) (18.1) (18.3) (30.7) (30.73) (26.9) (26.9)

#Y ! 25.0 12.3 23.4 18.2 24.7 10.6 20.0 20.4 27.0 6.20 11.7 8.42 12.9(13.0) (5.40) (8.57) (5.37) (14.6) (6.68) (10.7) (8.77) (18.9) (13.7) (16.9) (12.7)

$ ! .60 .30 .16 .79 .59 .08 .04 .64 .50 .15 .08 .55 .38(.32) (.46) (.27) (.20) (.52) (.57) (.48) (.40) (.47) (.53) (.40) (.37)

!Y|X ! .60 .16 .16 .61 .60 .04 .04 .61 .62 .04 .04 .21 .21(.45) (.47) (.25) (.27) (.56) (.57) (.57) (.57) (.56) (.56) (.43) (.43)

!X|Y ! .60 .61 .17 1.12 .60 .20 .06 .78 .45 .61 .19 1.63 .76(.26) (.46) (.64) (.24) (.44) (.56) (.75) (.40) (.55) (.53) (1.72) (.68)


"Y 39.2 60.0 58.5 71.0 0.2 2.4 25.7 32.3 0.0 0.0 0.0 0.0(7.0) (13.3) (10.4) (14.1) (6.0) (11.4) (11.6) (15.3) (3.5) (6.7) (4.8) (7.3)

#Y 0.7 63.7 31.3 65.4 0.1 45.3 30.0 49.4 0.0 1.7 0.7 4.4(5.1) (9.6) (7.5) (10.2) (4.4) (8.2) (8.4) (11.1) (2.5) (4.8) (3.5) (5.3)

$ 25.5 5.5 21.7 65.0 0.0 0.0 19.6 40.7 2.2 0.5 37.6 50.0(0.50) (0.53) (0.19) (0.35) (0.55) (0.55) (0.21) (0.34) (0.54) (0.54) (0.31) (0.43)

!Y|X 1.2 16.5 38.6 63.5 0.0 0.8 17.2 33.5 0.0 0.1 3.1 7.4(0.27) (0.54) (0.22) (0.44) (0.25) (0.47) (0.22) (0.45) (0.14) (0.27) (0.13) (0.26)

!X|Y 98.1 23.9 8.9 71.1 91.2 14.5 18.6 60.0 97.4 71.3 19.1 56.2(1.18) (0.63) (0.50) (0.47) (1.43) (0.75) (0.56) (0.46) (2.50) (1.30) (1.46) (1.05)

Note. Parameters: " is the population mean; # is the population standard deviation; $ is the population correlation; ! is the populationregression slope. Coverage represents the percentage of confidence intervals that include the parameter value; values near 95 representadequate coverage. Use of boldface type in the top panel indicates problematic levels of bias (i.e., bias whose absolute size is greater than aboutone half of the estimate’s standard error); use of boldface in the bottom panel indicates seriously low levels of coverage (i.e., coverage thatfalls below 90%, which corresponds to a doubling of the nominal rate of error). MCAR ! missing completely at random; MAR ! missingat random; MNAR ! missing not at random; MS ! mean substitution; HD ! hot deck; CM ! conditional mean imputation; PD !predictive distribution imputation; RMSE ! root-mean-square error.

MISSING DATA 161

bution permits the use of all participants with only aminor negative impact on estimates and uncertaintymeasures.

ML Estimation

The principle of drawing inferences from a likeli-hood function is widely accepted. Under MAR, themarginal distribution of the observed data (Equation2) provides the correct likelihood for the unknownparameters %, provided that the model for the com-plete data is realistic. Little and Rubin (1987, p. 89)referred to this function as “the likelihood ignoringthe missing-data mechanism,” but for brevity we sim-ply call it the observed-data likelihood. The logarithmof this function,

l (%; Yobs) ! log L (%; Yobs), (5)

plays a crucial role in estimation. The ML estimate %,the value of % for which Equation 5 is highest, hasattractive theoretical properties just as it does in com-plete-data problems. Under rather general regularityconditions, it tends to be approximately unbiased inlarge samples. It is also highly efficient; as the samplesize grows, its variance approaches the theoreticallower bound of what is achievable by any unbiasedestimator (e.g., Cox & Hinkley, 1974).

Confidence intervals and regions are often com-puted by appealing to the fact that, in regular prob-lems with large samples, % is approximately normallydistributed about the true parameter % with approxi-mate covariance matrix

V(%) ≈ [!l #(%)]!1, (6)

where l #(%) is the matrix of second partial derivativesof Equation 5 with respect to the elements of %. Thematrix !l #(%), which is often called observed infor-mation, describes how quickly the log-likelihoodfunction drops as we move away from the ML esti-mate; a steep decline indicates that the ML estimate isapparently precise, whereas a gradual decline impliesthere is considerable uncertainty about where the trueparameter lies. This matrix is sometimes replaced byits expected value, which is called expected informa-tion or Fisher information, because the expectedvalue is sometimes easier to compute. In complete-data problems, the approximation (Equation 6) is stillvalid when the observed information is replaced bythe expected information. However, as recentlypointed out by Kenward and Molenberghs (1998), thisis not necessarily true with missing data. Expected

information implicitly uses Equation 5 as a samplingdistribution for Yobs, which is valid only if the missingdata are MCAR. In missing-data problems, therefore,if we want to obtain standard errors and confidenceintervals that are valid under the general MAR con-dition, we should base them on an observed ratherthan an expected information matrix. If a softwarepackage that performs ML estimation with missingdata offers the choice of computing standard errorsfrom the observed or the expected information, theuser should opt for the former.

Log likelihood also provides a method for testinghypotheses about elements or functions of %. Supposethat we wish to test the null hypothesis that % lies ina certain area or region of the parameter space versusthe alternative that it does not. Under suitable regu-larity conditions, this test may be performed by com-paring a difference in log likelihoods to a chi-square distribution. More specifically, let % denote themaximizer of the log likelihood over the full param-eter space, and let % be the maximizer over the regiondefined by the null hypothesis. We would rejectthe null at the designated alpha level if 2 [l (%; Yobs) !l (%; Yobs)] exceeds the 100 (1 ! () percentile of thechi-square distribution. The degrees of freedom aregiven by the difference in the number of free param-eters under the null and alternative hypotheses—thatis, the number of restrictions that must be placed onthe elements of % to ensure that it lies in the nullregion. These likelihood-ratio tests are quite attractivefor missing-data problems, because they only requirethat we be able to compute the maximizers % and %; nosecond derivatives are needed.

Computing ML Estimates

In a few problems, the maximizer of the log like-lihood (Equation 5) can be computed directly. Onefamous example is the bivariate normal monotoneproblem presented by Anderson (1957). Suppose thata portion of sample units (Part A) has values recordedfor variables X and Y, and the remainder of sampleunits (Part B) has values only for X. The basic idea isas follows: Use the full sample (A + B) to estimate themean and variance of X, use the reduced sample (Aonly) to estimate the parameters of the linear regres-sion of Y and X, and then combine the two sets ofestimates to obtain the parameters for the full jointdistribution. For details on this procedure, see Littleand Rubin (1987, section 6.2).

Except for these special cases, expressions for MLestimates cannot in general be written down in closed


form, and computing them requires iteration. A gen-eral method for ML in missing-data problems wasdescribed by Dempster et al. (1977) in their influentialarticle on the EM algorithm. The key idea of EM is tosolve a difficult incomplete-data estimation problemby iteratively solving an easier complete-data prob-lem. Intuitively, we “fill in the missing data” with abest guess at what it might be under the current esti-mate of the unknown parameters, then reestimate theparameters from the observed and filled-in data. Toobtain the correct answer, we must clarify what itmeans to “fill in the missing data.” Dempster et al.showed that, rather than filling in the missing datavalues per se, we must fill in the complete-data suf-ficient statistics. The form of these statistics dependson the model under consideration. Overviews of EMhave been given by Little and Rubin (1987), Schafer(1997), and McLachlan and Krishnan (1996).

Little and Rubin (1987) catalogued EM algorithmsfor any missing-data problems. EM has also been ap-plied to many situations that are not necessarilythought of as missing-data problems but can be for-mulated as such: multilevel linear models for unbal-anced repeated measures data, where not all partici-pants are measured at all time points (Jennrich &Schluchter, 1986; Laird & Ware, 1982); latent classanalysis (Clogg & Goodman, 1984) and other finite-mixture models (Titterington, Smith, & Makov,1985); and factor analysis (Rubin & Thayer, 1983).For some of these problems, non-EM methods arealso available. Newton–Raphson and Fisher scoringare now considered by many to be the preferredmethod for fitting multilevel linear models (Lind-strom & Bates, 1988). However, in certain classes ofmodels—finite mixtures, for example—EM is still themethod of choice (McLachlan & Peel, 2000).

Software for ML Estimation inMissing-Data Problems

An EM algorithm for ML estimation of an unstruc-tured covariance matrix is available in several pro-grams. The first commercial implementation was re-leased by BMDP (BMDP Statistical Software, 1992),now incorporated into the missing-data module ofSPSS (Version 10.0). The procedure is also found inEMCOV (Graham & Hofer, 1991), NORM, SAS(Y. C. Yuan, 2000), Amelia (King et al., 2001), S-PLUS (Schimert, Schafer, Hesterberg, Fraley, &Clarkson, 2001), LISREL (Joreskog & Sorbom,2001), and Mplus (L. K. Muthen & Muthen, 1998).

ML is also available for normal models with struc-

tured covariance matrices. Multilevel linear modelscan be fit with HLM (Bryk, Raudenbush, & Congdon,1996), MLWin (Multilevel Models Project, 1996), theSAS procedure PROC MIXED (Littell, Milliken,Stroup, & Wolfinger, 1996), Stata (Stata, 2001), andthe lme function in S-PLUS (Insightful, 2001). Any ofthese may be used for repeated measures data. Insome cases, the documentation and accompanying lit-erature do not mention missing values specifically butdescribe “unbalanced” data sets, in which participantsare not measured at a common set of time points. Wemust emphasize that if the imbalance occurs not bydesign but as a result of uncontrolled nonresponse(e.g., attrition), all of these programs will assumeMAR. ML estimates for structural equation modelswith incomplete data are available in Mx (Neale,Boker, Xie, & Maes, 1999), AMOS (Arbuckle &Wothke, 1999), LISREL, and Mplus, which also as-sume MAR. These programs provide standard errorsbased on expected or observed information. If offereda choice, the user should opt for observed rather thanexpected, because the latter is appropriate only underMCAR. The producers of EQS (Bentler, in press)have also announced plans for a new version withmissing-data capabilities, but as of this writing it hasnot yet been released.

Latent class analysis (LCA) is a missing-data prob-lem in the sense that the latent classification is miss-ing for all participants. A variety of software packagesfor LCA are available; one of the most popular isLEM (Vermunt, 1997). Latent transition analysis(LTA), an extension of LCA to longitudinal studieswith a modest number of time points, is available inWinLTA (Collins, Flaherty, Hyatt, & Schafer, 1999).The EM algorithm in the most recent version of LTAallows missing values to occur on the manifest vari-ables in an arbitrary pattern. ML estimates for a widerclass of latent-variable models with incomplete data,including finite mixtures and models with categoricalresponses, are available in Mplus.

Example

To illustrate the properties of likelihood methods,we conducted another simulation using the bloodpressure example. One thousand samples of size N !50 were generated, and missing values were imposedon each sample by the MCAR, MAR, and MNARmethods. From each incomplete data set, ML esti-mates were computed by the technique of Anderson(1957). Standard errors were obtained by inverting theobserved information matrix, and approximate 95%

MISSING DATA 163

confidence intervals for each parameter were com-puted by the normal approximation (estimate ±1.96SEs).

Results from this simulation are summarized inTable 4. The top panel of Table 4 reports the averageand RMSE of the estimates, and the bottom panelreports the coverage and average width of the inter-vals. Once again, estimates whose bias exceeds onehalf of a standard error and coverage values below90% are displayed in boldface. Examining the top

panel, we find that the ML estimates are not substan-tially biased under MCAR or MAR but are quite bi-ased under MNAR. This agrees with the theoreticalresult that likelihood inferences are appropriate underany MAR situation. In the bottom panel, however,notice that the coverage for some intervals underMAR is quite poor. The reason for this is that asample of N ! 50 with a high rate of missing valuesis not large enough for the normal approximation towork well. Another simulation was performed withthe sample size increased to 250. Results from thatsimulation, which we omit, show that the intervals forall parameters achieve coverage near 95% underMCAR and MAR but have seriously low coverageunder MNAR.

General Comments on ML forMissing-Data Problems

In theory, likelihood methods are more attractivethan ad hoc techniques of case deletion and singleimputation. However, they still rest on a few crucialassumptions. First, they assume that the sample islarge enough for the ML estimates to be approxi-mately unbiased and normally distributed. In missing-data problems the sample may have to be larger thanusual, because missing values effectively reduce thesample size. Second, the likelihood function comesfrom an assumed parametric model for the completedata P (Yobs, Ymis; %). Depending on the particularapplication, likelihood methods may or may not berobust to departures from model assumptions. Some-times (e.g., in structural equation models) departuresmight not have a serious effect on estimates but couldcause standard errors and test statistics to be verymisleading (Satorra & Bentler, 1994). If one dis-penses with the full parametric model, estimation pro-cedures with incomplete data are still possible, butthey typically require the missing values to be MCARrather than MAR (K. H. Yuan & Bentler, 2000; Zeger,Liang, & Albert, 1988). For an evaluation of thesenew procedures for structural equation models, seethe recent article by Enders (2001).

Finally, the likelihood methods described in thissection assume MAR. When missingness is not con-trolled by the researcher, it is unlikely that MAR isprecisely satisfied. In many realistic applications,however, we believe that departures from MAR arenot large enough to effectively invalidate the resultsof an MAR-based analysis (Collins et al., 2001).When the reasons for missingness seem strongly re-lated to the data, one can formulate a likelihood or

Table 4Performance of Maximum Likelihood for ParameterEstimates and Confidence Intervals Over 1,000 Samples(N = 50 Participants)



"Y ! 125.0 124.8 125.2 151.6(6.52) (16.9) (26.9)

#Y ! 25.0 24.2 25.5 12.3(5.73) (7.45) (13.2)

$ ! .60 .61 .52 .39(.19) (.38) (.36)

!Y|X ! .60 .61 .60 .21(.27) (.51) (.43)

!X|Y ! .60 .63 .49 .79(.23) (.38) (.68)


"Y 91.2 91.6 0.9(22.2) (58.2) (16.4)

#Y 86.1 90.2 7.4(17.8) (28.6) (9.94)

$ 84.2 76.7 89.2(0.65) (1.20) (0.99)

!Y|X 90.3 90.7 28.2(0.88) (1.78) (0.59)

!X|Y 91.6 93.0 80.5(0.80) (1.26) (2.18)

Note. Parameters: " is the population mean; # is the populationstandard deviation; $ is the population correlation; ! is the popu-lation regression slope. Coverage represents the percentage of con-fidence intervals that include the parameter value; values near 95represent adequate coverage. Use of boldface type in the top panelindicates problematic levels of bias (i.e., bias whose absolute size isgreater than about one half of the estimate’s standard error); use ofboldface in the bottom panel indicates seriously low levels of cov-erage (i.e., coverage that falls below 90%, which corresponds to adoubling of the nominal rate of error). MCAR ! missing com-pletely at random; MAR ! missing at random; MNAR ! missingnot at random; RMSE ! root-mean-square error.


Bayesian solution to take this into account. However,these methods—which are reviewed in the last sectionof this article—are not a panacea, because they stillrest on unverifiable assumptions and may be sensitiveto departures from the assumed model.

Multiple Imputation

MI, proposed by Rubin (1987), has emerged as aflexible alternative to likelihood methods for a widevariety of missing-data problems. MI retains much ofthe attractiveness of single imputation from a condi-tional distribution but solves the problem of under-stating uncertainty. In MI, each missing value is re-placed by a list of m > 1 simulated values as shown inFigure 4. Substituting the jth element of each list forthe corresponding missing value, j ! 1, . . . , m, pro-duces m plausible alternative versions of the completedata. Each of the m data sets is analyzed in the samefashion by a complete-data method. The results,which may vary, are then combined by simple arith-metic to obtain overall estimates and standard errorsthat reflect missing-data uncertainty as well as finite-sample variation. Reviews of MI have been publishedby Rubin (1996) and Schafer (1997, 1999a). Graham,Cumsille, and Elek-Fisk (in press) and Sinharay,Stern, and Russell (2001) have provided less technicalpresentations for researchers in psychology.

MI has many attractive features. Like single impu-tation, it allows the analyst to proceed with familiarcomplete-data techniques and software. One good setof m imputations may effectively solve the missing-data problems in many analyses; one does not neces-sarily need to re-impute for every new analysis. Un-like other Monte Carle methods, with MI we do not

need a large number of repetitions for precise esti-mates. Rubin (1987) showed that the efficiency of anestimate based on m imputations, relative to one basedon an infinite number, is (1 + */m)!1, where * is therate of missing information.2 For example, with 50%missing information, m ! 10 imputations is 100/(1 +.05) ! 95% efficient; additional imputations do littleto remove noise from the estimate itself. In somecases, researchers also like to remove noise fromother statistical summaries (e.g., significance levels orprobability values); in many practical applications, wehave found that m ! 20 imputations can effectivelydo this. Once a procedure for managing multiple ver-sions of the data has been established, the additionaltime and effort required to handle m ! 20 versionsrather than m ! 10 is often of little consequence.

Rubin’s Rules for Combining Estimates andStandard Errors

The simplest method for combining the results of manalyses is Rubin’s (1987) method for a scalar (one-dimensional) parameter. Suppose that Q represents apopulation quantity (e.g., a regression coefficient) tobe estimated. Let Q and √U denote the estimate of Qand the standard error that one would use if no datawere missing. The method assumes that the sample islarge enough so that √U(Q ! Q) has approximately astandard normal distribution, so that Q ± 1.96 √U hasabout 95% coverage. Of course, we cannot compute Qand U; rather, we have m different versions of them,[Q ( j), U ( j)], j ! 1, . . . , m. Rubin’s (1987) overallestimate is simply the average of the m estimates,

Q = m−1$j=1

m

Q ! j #.

The uncertainy in Q has two parts: the average within-imputation variance,

U = m−1$j=1

m

U ! j #,

and the between-imputations variance,

2 The rate of missing information, as distinct from therate of missing observations, measures the increase in thelarge-sample variance of a parameter estimate (Equation 6)due to missing values. It may be greater or smaller than therate of missing values in any given problem.

Figure 4. Schematic representation of multiple imputa-tion, where m is the number of imputations.

MISSING DATA 165

B = !m − 1#−1$j=1

m

%Q ! j # − Q&2.

The total variance is a modified sum of the two com-ponents,

T = U + !1 + m−1#B,

and the square root of T is the overall standard error.For confidence intervals and tests, Rubin (1987) rec-ommended the use of a Student’s t approximationT !1/2(Q ! Q) ! tv, where the degrees of freedom aregiven by

v = !m − 1#$1 +U

!1 + m−1#B%2

.

The degree of freedom may vary from m ! 1 to +depending on the rate of missing information. Whenthe degrees of freedom are large, the t distribution isessentially normal, the total variance is well esti-mated, and there is little to be gained by increasing m.The estimated rate of missing information for Q isapproximately ,/(, + 1), where , ! (1 + m!1)B/U isthe relative increase in variance due to nonresponse.Additional methods for combining multidimensionalparameter estimates, likelihood-ratio test statistics,and probability values from hypothesis tests were re-viewed by Schafer (1997, chapter 4).

Proper MI

The validity of MI rests on how the imputations arecreated and how that procedure relates to the modelused to subsequently analyze the data. Creating MIsoften requires special algorithms (Schafer, 1997). Ingeneral, they should be drawn from a distribution forthe missing data that reflects uncertainty about theparameters of the data model. Recall that with singleimputation, it is desirable to impute from the condi-tional distribution P(Ymis|Yobs; %), where % is an esti-mate derived from the observed data. MI extends thisby first simulating m independent plausible values forthe parameters, %(1), . . . , %(m), and then drawing themissing data Y (t)

mis from P[Ymis|Yobs; %(t)] for t !1, . . . , m.

Treating parameters as random rather than fixed isan essential part of MI. For this reason, it is natural(but not essential) to motivate MI from the Bayesianperspective, in which the state of knowledge aboutparameters is represented through a posterior distri-bution. Bayesian methods have become increasinglypopular in recent years; a good modern overview was

provided by Gelman et al. (1995). For our purposes, itis not necessary to delve into the details of the Bayes-ian perspective to explain how MI works; we simplynote a few principles of Bayesian analysis.

First, in a Bayesian analysis, all of the data’s evi-dence about parameters is summarized with a likeli-hood function. As with ML, the assumed parametricform of the model may be crucial; if the model isinaccurate, then the posterior distribution may providean unrealistic view of the state of knowledge about %.Second, Bayesian analysis requires a prior distribu-tion for the unknown parameters. Critics may regardthe use of a prior distribution as subjective, artificial,or unscientific. We tend to regard it as a “necessaryevil.” In some problems, prior distributions can beformulated to reflect a state of relative ignoranceabout the parameters, mitigating the effect of the sub-jective inputs. Finally, in a Bayesian analysis, the in-fluence of the prior diminishes as the sample sizeincreases. Indeed, one often finds that a range of plau-sible alternative priors leads to similar posterior dis-tributions. Because MI already relies on large-sampleapproximations for the complete-data distribution, theprior rarely exerts a major influence on the results.

Creating MIs Under a Normal Model

To understand what is happening within an MI al-gorithm, consider a hypothetical data set with threevariables Y1, Y2, and Y3, which we assume to bejointly normally distributed. Suppose that one groupof participants (Group A) has measurements for allthree variables, another group (Group B) has mea-surements for Y1 and Y2 but missing values for Y3, anda third group (Group C) has measurements for Y3 butmissing values for Y1 and Y2. The parameters of thetrivariate normal model—three means, three vari-ances and three correlations—are not known andshould be estimated from all three groups. If the pa-rameters were known, MIs could be drawn in thefollowing way. Group A requires no imputation. ForGroup B, we would need to compute the linear re-gression of Y3 on Y1 and Y2. Then, for each participantin Group B, we would use his or her own values of Y1

and Y2 to predict the unknown value of Y3 and imputethe predicted value Y3 plus random noise drawn froma normal distribution with the appropriate residualvariance. For Group C, we would compute the bivari-ate regression of Y1 and Y2 on Y3, obtain the jointprediction (Y1, Y2) for each participant, and add ran-dom noise drawn from a bivariate normal distribution


with the appropriate residual variances and covari-ance.

A crucial feature of MI is that the missing valuesfor each participant are predicted from his or her ownobserved values, with random noise added to preservea correct amount of variability in the imputed data.Another feature is that the joint relationships amongthe variables Y1, Y2, and Y3 must be estimated from allavailable data in Groups A, B, and C. ML estimates ofthe parameters could be computed using an EM algo-rithm, but proper MI requires that we reflect uncer-tainty about these parameters from one imputation tothe next. Therefore, instead of using ML estimates,we need to draw random values of the parametersfrom a posterior distribution based on the observed-data likelihood and a prior. The form of this posteriordistribution is not easy to describe, but it can besampled from by a variety of techniques; data aug-mentation (Schafer, 1997) is straightforward, but onecould also use importance resampling (King et al.,2001). The effect of drawing parameters from a pos-terior distribution, rather than using ML estimates,means that for Group B the regression of Y3 on Y1 andY2 will be randomly perturbed from one set of impu-tations to the next; similarly, in Group C the jointregression of (Y1, Y2) on Y3 will also be perturbed.

A review of MI computations is beyond the scopeof this article. Rubin (1987) described how to createMIs for some univariate and monotone situationsshown in Figure 1, a and b. Algorithms for multivari-ate data with arbitrary patterns were given by Schafer(1997), along with detailed guidance and data ex-amples. Proper application of these procedures re-quires some understanding of the properties of dataaugmentation, especially its convergence behavior. Agentle introduction and tutorial on the use of MI undera multivariate normal model was provided by Schaferand Olsen (1998; also see Graham et al., in press;Sinharay et al., 2001).

Choosing the Imputation Model

Notice that the MI procedure described above isbased on a joint normality assumption for Y1, Y2, andY3. This model makes no distinctions between re-sponse (dependent) or predictor (independent) vari-ables but treats all three as a multivariate response.The imputation model is not intended to provide aparsimonious description of the data, nor does it rep-resent structural or causal relationships among vari-ables. The model is merely a device to preserve im-portant features of the joint distribution (means,

variances, and correlations) in the imputed values. Aprocedure that preserves the joint distribution of Y1,Y2, and Y3 will automatically preserve the linear re-gression of any of these variables on the others.Therefore, in a subsequent analysis of the imputeddata, any variable could be treated as a response or asa predictor. For example, we may regress Y2 on Y1 ineach imputed data set and combine the estimated in-tercepts and slopes by Rubin’s (1987) rules. Distinc-tions between dependent and independent variablesand substantive interpretation of relationships shouldbe left to postimputation analyses.

Although it is not necessary to have a scientifictheory underlying an imputation model, it is crucialfor that model to be general enough to preserve effectsof interest in later analyses. Suppose that we will ex-amine differences in mean response between an ex-perimental and control group. For differences to bepreserved in the imputed values, some indicator ofgroup membership should enter the imputation model.For example, a dummy variable (0!control, 1!ex-perimental) could be included in the normal model,which will preserve the main effect of group mem-bership on any other variable. To preserve interactionbetween group membership and other variables, onecould split the data set and apply a separate imputa-tion model to each group.

Real data rarely conform to normality. Some mighthesitate to use a normal imputation model, fearingthat it may distort the distributions of nonnormal vari-ables. Many tricks are available to help preserve dis-tributional shape (Schafer, 1997). Binary or ordinalvariables may be imputed under a normality assump-tion and then rounded off to discrete values. If a vari-able is right skewed, it may be modeled on a loga-rithmic (or some other power-transformed) scale andtransformed back to the original scale after imputa-tion. Transformations and rounding help to make theimputed values aesthetically appealing; for example, alog transformation for a positive variable will guar-antee that the imputed values are always positive. De-pending on the analysis applied to the imputed data,however, these procedures may be superfluous. Gra-ham and Schafer (1999) presented a simulation inwhich highly nonnormal variables were imputed un-der normality assumptions with no transformations orrounding and reported excellent performance for lin-ear regression even with relatively small samples. Al-though joint normality is rarely realistic, we havefound the model to be useful in a surprisingly widevariety of problems. Analyses involving normal-

MISSING DATA 167

based MI with real data have been published by Gra-ham and Hofer (2000), Graham et al. (in press), andothers.

Of course, there are situations in which the normalmodel should be avoided—to impute variables thatare nominal (unordered categories), for example.Schafer (1997) presented imputation methods formultivariate categorical data and for data sets contain-ing both continuous and categorical variables. Withthese models, one can specify and preserve higherorder associations among the variables, provided thatthe data are rich enough to estimate these associa-tions. These models assume that the rows or observa-tional units in the data set have been independentlysampled and thus do not automatically take into ac-count longitudinal or clustered structure. Imputationmodels specifically designed for longitudinal andclustered data have been described by Liu, Taylor,and Belin (2000) and Schafer (2001).

For a small class of problems, it is possible to cre-ate MIs without a data model. Rubin (1987) describeda method called the approximate Bayesian bootstrap(ABB), which involves two cycles of hot-deck impu-tation. Rubin’s ABB applies to univariate missingdata without covariates. Lavori, Dawson, and Shera(1995) generalized the ABB to include fully observedcovariates as in Figure 1a. This method has beenimplemented in a program called SOLAS (StatisticalSolutions, 1998), where it is called the propensity-score option. This procedure was designed to provideunbiased estimates of the distribution of a single out-come, not to preserve the relationship between theoutcome and other items. Imputations created by thismethod may seriously distort covariance structure(Allison, 2000). An alternative procedure in SOLAS,called the model-based method, is described below; itis more appropriate for situations in which postimpu-tation analyses will involve covariances and correla-tions.

MI Software

Many computer programs for MI are now avail-able. NORM, a free program for Windows, createsMIs for incomplete data with arbitrary patterns ofmissing values under an unstructured normal model.Algorithms used in NORM were described by Schafer(1997). NORM includes utilities for automatic pre-and postimputation transformations and roundingof imputed values. A new SAS procedure, PROC MI(Y. C. Yuan, 2000), brings the method used byNORM into the popular SAS environment. Other ver-

sions of the same algorithm have also been imple-mented in a newly released missing-data library inS-PLUS and in LISREL. Amelia, a free program cre-ated by King et al. (2001), relies on the normal modelbut uses a different computational technique.

With any of the programs mentioned above, a suf-ficient number of cases must be available to estimatean unstructured covariance matrix for all variables.This may be problematic. For example, in psychologi-cal research it is not uncommon to collect about p !100 or more items for only N ! 100 subjects, with theintention of later averaging the items into a few scalesor subscales. Without imposing additional prior infor-mation or structure, we cannot fit a normal model toall items at once. Song (1999) simplified the covari-ance structure by assuming a few common factors,generating MIs under this restricted model; however,a software implementation of this new method is notyet available. Longitudinal structure arising from re-peated measurements over time has been imple-mented in the S-PLUS function PAN (Schafer, 2001;Schafer & Yucel, in press).

For nonnormal imputation models, the softwarechoices are more limited. Methods described by Scha-fer (1997) for multivariate categorical data, and formixed data sets containing both continuous and cat-egorical variables, are commercially available in thenew S-PLUS missing-data module; this supersedesthe older CAT and MIX libraries for S-PLUS writtenby Schafer. The propensity score method of SOLASdoes not assume a parametric model for the data but,as previously mentioned, can seriously distort inter-variable relationships and is potentially dangerous.

Rather than describing the joint distribution of allvariables by a multivariate model, some prefer toimplement MI with a sequence of single-response re-gressions. With the monotone pattern of Figure 1b, theinformation about missing values can indeed be cap-tured by a sequence of regressions for Yj givenY1, . . . , Yj!1 for j ! 1, . . . , p. The so-called model-based method in SOLAS (Version 2.0 or later) createsproper MIs for univariate or monotone patterns bysuch a regression sequence. The method can also beapplied to nonmonotone patterns, but cases that do notconform to the monotone pattern are ignored in themodel fitting. For data sets that deviate substantiallyfrom a monotonicity, the joint modeling proceduresused in S-PLUS seem preferable.

In a large survey application, Kennickell (1991)performed approximate MI for nonmonotone data byspecifying a single-response regression model for


each variable given the others and repeatedly fittingthe models in an iterated sequence. A general imple-mentation of this method using SAS macros is foundin the IVEware library (Raghunathan, Solenberger, &Van Hoewyk, 2000). A similar library for S-PLUS,called MICE, was written by Van Buuren and Oud-shoorn (1999). From a theoretical standpoint, thistechnique is problematic, because the sequence of re-gression models might not be consistent with a truejoint distribution. Technically speaking, these itera-tive algorithms may never “converge” because thejoint distribution to which they may converge doesnot exist. Nevertheless, simulation work (Brand,1999) suggests that in some practical applications themethod can indeed work well despite the theoreticalproblems.

Some utilities are also available to simplify the taskof analyzing imputed data sets. NORM combines pa-rameter estimates and standard errors from multipleanalyses using Rubin’s (1987) rules; it can also com-bine groups of coefficients and their covariance ma-trices for multiparameter inference. A new SAS pro-cedure called PROC MIANALYZE does essentiallythe same thing. The new S-PLUS missing-data libraryincludes the functions miApply and miEval, whichautomatically carry out a data analysis procedure(e.g., fitting a regression model) for all the imputeddata sets, storing the results together in a convenientformat; four additional functions are available to con-solidate the results.

This list of software is not exhaustive and may beoutdated when this article appears in print. For up-to-date information on MI software, readers should referto on-line resources; a good starting point is the Website at http://www.multiple-imputation.com. Some ofthe programs listed above were recently reviewed byHorton and Lipsitz (2001).

Example

Returning to the blood pressure example, we simu-lated one thousand samples of N ! 50 participantsand imposed missing values by the MCAR, MAR,and MNAR methods. Using NORM, we multiply im-puted the missing values for each incomplete data set20 times. We chose m ! 20 because of the unusuallyhigh rate of missing values in this example (nearly80%); with more moderate rates fewer would suffice.After imputation, we computed estimates and infor-mation-based standard errors for the five parametersfrom each imputed data set, then combined the results

across each set of m ! 20 imputations by Rubin’s(1987) rules.

The results are shown in Table 5. Comparing thetop panel of Table 5 with the likelihood-based resultsin the top panel of Table 4, we see that MI and MLestimates are very similar. NORM’s imputationmethod is theoretically appropriate for MAR missing-ness, but some bias is evident under the MAR condi-tion because of the small sample size; if the samplesize is increased, the bias disappears. The behavior of

Table 5Performance of Multiple Imputation (m = 20) Under aNormal Model for Parameter Estimates and ConfidenceIntervals Over 1,000 Samples (N = 50 Participants)



"Y ! 125.0 124.9 125.3 151.6(6.53) (17.2) (26.9)

#Y ! 25.0 25.9 28.7 13.6(5.93) (8.24) (12.1)

$ ! .60 .57 .45 .35(.19) (.37) (.36)

!Y|X ! .60 .61 .59 .21(.27) (.52) (.43)

!X|Y ! .60 .56 .39 .66(.22) (.38) (.56)


"Y 93.5 94.5 1.8(26.1) (71.5) (19.9)

#Y 93.2 96.1 18.0(22.1) (35.1) (12.8)

$ 90.8 86.9 90.4(0.75) (1.35) (1.06)

!Y|X 93.6 94.5 39.4(1.05) (2.18) (0.71)

!X|Y 94.9 95.0 91.4(0.87) (1.29) (2.18)

Note. m is the number of imputations. Parameters: " is the popu-lation mean; # is the population standard deviation; $ is the popu-lation correlation; ! is the population regression slope. Coveragerepresents the percentage of confidence intervals that include the pa-rameter value; values near 95 represent adequate coverage. Use ofboldface type in the top panel indicates problematic levels of bias (i.e.,bias whose absolute size is greater than about one half of the estimate’sstandard error); use of boldface in the bottom panel indicates seriouslylow levels of coverage (i.e., coverage that falls below 90%, whichcorresponds to a doubling of the nominal rate of error). MCAR !missing completely at random; MAR ! missing at random; MNAR! missing not at random; RMSE ! root-mean-square error.

MISSING DATA 169

the MI intervals reported in the bottom panel of Table5, however, shows some improvement over the like-lihood-based intervals in the bottom panel of Table 4;in most cases the coverage is closer to 95%. Goodperformance of MI intervals in small samples hasbeen previously noted (Graham & Schafer, 1999). In-creasing the sample size to 250, we found that theresults from MI became nearly indistinguishable fromthe ML results with the same sample size.

General Comments on MI

MI is a relative newcomer, but its theoretical prop-erties are fairly well understood. Because MI relies onBayesian arguments, its performance is similar to thatof a likelihood method that makes similar assump-tions. Like ML, MI does rely on large-sample ap-proximations, but some limited experience (e.g., thesimulation reported above) suggests that with small tomoderate sample sizes these approximations maywork better for MI than they do for ML. Of course,MI also requires assumptions about the distribution ofmissingness. Nearly all MI analyses to date have as-sumed that the missing data are MAR, but a fewMNAR applications have been published (Glynn etal., 1993; Verbeke & Molenberghs, 2000). Nothing inthe theory of MI requires us to keep the MAR as-sumption, and new methods for generating MIs underMNAR models will certainly arrive in the future.

MI uses a model at the imputation phase, so robust-ness to departures from the model is a concern. Inmany situations we expect MI to be fairly robust,because the model is effectively applied not to theentire data set but only to the conditional distributionof the missing part. It is also possible to combine afully parametric MI procedure with postimputationanalysis by a robust method, and unless the imputa-tion model is grossly misspecified the performanceshould be quite good (Meng, 1999). In structuralequation modeling, for example, one could multiplyimpute the missing values under a normality assump-tion and then fit the structural model to the imputeddata using the robust techniques of Satorra andBentler (1994). Although further study of this isneeded, we conjecture that the properties of these hy-brid procedures will be excellent, perhaps betterthan the normality-based likelihood approach ofAMOS or Mx.

One important difference between MI and likeli-hood methods is that with likelihood the missing val-ues are dealt with during the model-fitting procedure,whereas in MI they are dealt with prior to the analysis.

When the same model is used for imputation andanalysis, MI produces answers similar to those of alikelihood analysis under that same model. Much ofthe strength and flexibility (and, perhaps, the danger)of MI, however, stems from the interesting possibilityof using different models for imputation and analysis.Differences in these two models do not necessarilyinvalidate the method but may actually strengthen it.With MI the imputer is free to make use of additionaldata (e.g., extra variables) that do not appear in theanalysis, and if those data are useful for predictingmissing values, then MI increases power. Propertiesof MI when the imputer’s and analyst’s models differhave been explored theoretically by Meng (1994) andRubin (1996) and from a practical standpoint by Col-lins, Schafer, and Kam (2001).

Recent Developments

Methods Based on Weighting

The notion of reducing bias due to non-MCARmissingness by reweighting has a long history in thesurvey literature (Little & Rubin, 1987, chapter 4).Recently, biostatisticians have begun to apply thisidea in regression modeling with incomplete covari-ates. Robins et al. (1994) developed weighted regres-sion that requires an explicit model for the missing-ness but relaxes some of the parametric assumptionsin the data model. Their method is an extension ofgeneralized estimating equations (GEE), a populartechnique for modeling marginal or population-averaged relationships between a response variableand predictors (Zeger et al., 1988). These models arecalled semiparametric, because they require the re-gression equation to have a specific form (e.g., linearor log-linear) but beyond that do not specify any par-ticular probability distribution for the response vari-able itself. The same estimation procedure can be ap-plied whether the response is continuous or discrete.Older GEE methods can accommodate missing valuesonly if they are MCAR; newer methods allow them tobe MAR or even MNAR, provided that a model forthe missingness is correctly specified. Further resultsand extensions have been given by Robins and Rot-nitzky (1995) and Robins, Rotnitzky, and Scharfstein(1998).

A primary motivation of these weighting methodsis to achieve robustness, good performance over moregeneral classes of population distributions. However,extra generality does not come for free. Semiparamet-ric estimators can be less efficient and less powerful


than ML or Bayesian estimators under a well-specified parametric model. With missing data, Ru-bin’s (1976) results show that ML or Bayesian meth-ods perform uniformly well over any MARmissingness distribution, and the user does not need tospecify that distribution. However, semiparametricmethods that relax assumptions about the data must inturn assume a specific form for the distribution ofmissingness. We acknowledge that these weightingtechniques may be useful in some circumstances.However, as a general principle, we also believe thata researcher’s time and effort are probably betterspent building an intelligent model for the data ratherthan building a good model for the missingness, es-pecially if departures from MAR are not a seriousconcern. In one simulated example, Meng (1999)noted that, for these semiparametric methods to gain asubstantial advantage over Bayesian MI, the paramet-ric model had to be so grossly misspecified that onlythe most “statistically challenged” researcher woulduse it.

Principles of weighting have also been used tocompute MI estimates for fully parametric regressionmodels with missing covariates. Ibrahim (1990) de-veloped a weighted estimation method for generalizedlinear models, a class that encompasses traditionallinear regression, logistic regression, and log-linearmodeling (McCullagh & Nelder, 1989). Ibrahim’smethod is an EM algorithm under a generic model forthe joint distribution of the predictors. This methodrequires the predictors to be discrete and does not takerelationships among them into account. With a normalresponse, Ibrahim’s algorithm is a special case of EMfor mixed continuous and categorical data consideredby Little and Rubin (1987) and Schafer (1997). Themethod has also been used for survival analysis(Schluchter & Jackson, 1989; Lipsitz & Ibrahim,1996, 1998). Weighting methods for ML regressionwith missing covariates were reviewed by Horton andLaird (1999). Although formal comparisons have notyet been made, we expect that, in many cases, theseweighting techniques produce answers similar to MIunder a suitable joint model for the response and co-variates.

Methods That Do Not Assume MAR

Many recent publications focus on MNAR miss-ingness. MNAR is a potentially serious concern inclinical trials, in which participants may be droppingout for reasons closely related to the response beingmeasured. For example, Hedeker and Gibbons (1997)

described an experiment in which patients weretreated for depression. The response variable was astandardized depression score. Patients who were do-ing well (i.e., experiencing lower levels of depression)appeared to have higher rates of attrition, perhaps be-cause they believed treatment was no longer neces-sary. Patients who were not improving also appearedto have higher attrition rates, perhaps because theydecided to seek alternative treatment. In these situa-tions, it seems useful to allow the probability of drop-out at any occasion to depend on the participant’sresponse at that occasion.

Without the MAR assumption, one must explicitlyspecify a distribution for the missingness in additionto the model for the complete data. There are twofundamentally different ways to do this: selectionmodels and pattern-mixture models.

Selection models. Selection models were firstused by econometricians to describe how the prob-ability of response to a sensitive questionnaire item(e.g., personal income) may depend on that item(Amemiya, 1984; Heckman, 1976). In a selectionmodel, we first specify a distribution for the completedata, then propose a manner in which the probabilityof missingness depends on the data. For example, wecould assume that the logarithm of income is normallydistributed in the population and that each individu-al’s probability of responding is related to his or herlog-income by logistic or probit regression. Math-ematically, a selection model builds a joint distribu-tion for the complete data Ycom and the missingness Rby specifying a marginal distribution for Ycom !(Yobs, Ymis) and a conditional distribution for R givenYcom,

P(Ycom, R; %, &) ! P(Ycom; %) P(R|Ycom; &), (7)

where % represents unknown parameters of the com-plete-data population and & represents unknown pa-rameters of the conditional distribution of missing-ness. The likelihood function is obtained bycollapsing this joint distribution over the unknownYmis, as shown in Equation 3.

Selection models for longitudinal studies withdropout have been reviewed by Little (1995) and Ver-beke and Molenberghs (2000). In typical applications(e.g., Diggle & Kenward, 1994), researchers assumethat measurements over time (Y1, . . . , YT) follow awell-recognized distribution such as a multivariatenormal and allow the probability of dropout at occa-sion t to follow a logistic regression on the previousand current responses (Y1, . . . , Yt) but not on future

MISSING DATA 171

responses. The procedures required to obtain ML es-timates are not trivial, because the likelihood (Equa-tion 3) often cannot be computed directly and must beapproximated. The likelihood for these models can beoddly shaped, with large, flat regions indicating thatparameters are poorly identified.

Selection models for dropout have intuitive appeal.It is natural to think about how a participant’s datavalues may influence his or her probability of drop-ping out, a notion that corresponds directly to Equa-tion 7. Moreover, the same factorization is used in thedefinitions of MCAR, MAR, and MNAR. If we alterthe dropout model by setting the logistic coefficientsfor (Y1, . . . , Yt) or Yt to zero, we obtain MCAR andMAR versions as special cases. This raises the inter-esting possibility of “testing” the MAR hypothesis, byseeing whether the confidence interval for the coeffi-cient of Yt includes zero. As pointed out by Kenward(1998), Little and Rubin (1987, chapter 11), and manyothers (see, e.g., the discussions following Diggle &Kenward, 1994), results of such tests rest heavily onuntestable assumptions about the population distribu-tion, and minor changes in the assumed shape of thisdistribution may drastically alter the conclusions(Kenward, 1998). Many consider these models to betoo unstable for scientific applications and to be moreuseful for raising questions than generating answers(Laird, 1994).

Pattern-mixture models. As an alternative to theselection model, Little (1993) described an alternativeclass of MNAR methods based on a pattern-mixtureformulation. Pattern-mixture models do not describeindividuals’ propensities to respond. Rather, theyclassify individuals by their missingness and describethe observed data within each missingness group. Ageneric pattern-mixture model can be written as

P(Ycom, R; %, &) ! P(R; -) P(Ycom|R; .), (8)

where - denotes the proportions of the populationfalling into the various missingness groups and . rep-resents the parameters of the conditional distributionsof the data within groups. Estimation of . alwaysrequires some unverifiable assumptions, because aportion of Ycom is hidden for every group having miss-ing data; Little (1993) called these assumptions iden-tifying restrictions. For a review of pattern-mixturemodels for longitudinal studies with dropout, seeLittle (1995) or Verbeke and Molenberghs (2000).

Pattern-mixture models are closely related to mul-tiple-group procedures for missing data in structuralequation modeling (Allison, 1987; Duncan & Duncan,

1994; Muthen, Kaplan, & Hollis, 1987). In the mul-tiple-groups approach, observational units are sortedby missingness pattern and each pattern is assumed toprovide information about a subset of the model pa-rameters. A model is fit to each pattern, and param-eters from different patterns with equivalent meaningare constrained to be equal. Because of the particularform of the constraints used in these published ar-ticles, the estimation procedures yielded ML param-eter estimates appropriate under MAR. Other types ofconstraints that produce MNAR models are possible;for a simple example, see Little (1994).

By nature, pattern-mixture models do not positstrong theories about the mechanisms of missingness;rather, they describe the observed responses in eachmissingness group and extrapolate aspects of this be-havior to unseen portions of the data. The likelihoodfunction for a pattern-mixture model, obtained by col-lapsing Equation 8 over the missing data Ymis, tends tobe more convenient to maximize than the likelihoodfor a selection model. In some classes of pattern-mixture models—the so-called random coefficientsmodels used by Hedeker and Gibbons (1997)—MLestimates can be computed by conventional longitu-dinal modeling software. One inconvenient feature ofthese models, however, is that the parameters appear-ing in the formulation (Equation 8) are rarely the pa-rameters of scientific interest. In most cases, we wantto describe some aspect of the distribution of Ycom

(e.g., a treatment effect) in the population of all miss-ingness groups combined. To estimate those param-eters, one usually needs to compute a weighted aver-age of group-specific estimates, with weightsdetermined by the relative sizes of the groups in thesample. Alternatively, one may carry out this averag-ing through MI.

Pattern-mixture models may not suffer from theextreme sensitivity to distributional shape exhibitedby selection models, but their assumptions are no lessstrong. Estimation of population effects is possibleonly through identifying restrictions, and the observeddata provide no evidence whatsoever to support orcontradict these assumptions. Proponents of pattern-mixture models (e.g., Little, 1993) have suggestedusing these methods for sensitivity analysis, varyingthe identifying restrictions to see how the resultschange. Detailed examples of pattern-mixture model-ing were given by Verbeke and Molenberghs (2000).

Discussion of MNAR methods. MNAR modelingseems worthwhile for clinical studies in which rea-sons for dropout may be closely related to the out-


comes being measured. In other situations, psycholo-gists should perhaps resist the urge to apply thesemethods routinely. Simulations by Collins et al.(2001) show that when the true cause of missingnessis the response variable itself, failure to account forthe MNAR aspect may indeed introduce a sizable biasinto parameter estimates. In other situations in whichthe true cause is not the response but an unmeasuredvariable that is only moderately correlated with theresponse (with, say, a correlation of .40 or less), fail-ure to account for the cause seems capable of intro-ducing only minor bias. For many social science ap-plications, we suspect that the former is the exceptionand the latter is the rule. For example, Graham et al.(1997) described a longitudinal study of drug and al-cohol use among middle and high school students.Attrition greatly reduced the sample size over time.Common notions about substance use might castdoubt on results of an MAR-based analysis, becauseusers could be dropping out of school at higher ratesthan nonusers. Subsequent debriefing of the data col-lectors revealed, however, that in most cases attritioncould be explained by students’ moving away ortransferring to other schools that did not participate inthe study. It was relatively infrequent that dropoutcould be plausibly related to substance use (Graham etal., 1997). Even if one argues that mobility and sub-stance use are related, it stretches the imagination tobelieve that the correlation between them could bemuch greater than .40. Therefore, we are inclined totrust the results of an MAR-based analysis in thisexample.

When data are collected by self-report question-naire, it is again natural to speculate whether miss-ingness is caused by the phenomena being measured,especially if the items are of a personal or sensitivenature. If an item pertains to nonnormative behavior,some participants exhibiting that behavior may indeedleave it blank in order to mask their true values, de-spite repeated assurances of confidentiality. On theother hand, some who do not exhibit that behaviormay also skip the item, thinking that the questioncannot possibly apply to them. Reasons for nonre-sponse vary from one person to another. Many ofthese reasons could be correlated with the item itself,but probably not to the same degree, and perhaps noteven in the same direction. Are we to believe that thecorrelation between the best aggregate measure of“cause” is correlated with our item to a degree of .40or more, even after accounting for its relationship toother observed covariates? Some methodologists may

be afraid to answer this question, choosing instead toleave it blank. Perhaps we are too bold, but in mostcases we are inclined to say, “No.”

If MNAR attrition is anticipated, researchers maybe able to mitigate its effects by simple changes in thestudy design. For example, at each occasion of mea-surement, we could ask each participant, “How likelyare you to drop out of this study before the next ses-sion?” Collecting this additional covariate and includ-ing it in analyses may effectively convert an MNARsituation to MAR.

Concluding Remarks

Although other procedures are occasionally useful,we recommend that researchers apply likelihood-based procedures, where available, or the parametricMI methods described in this article, which are ap-propriate under general MAR conditions. As MNARmethods are incorporated into mainstream software,they, too, become attractive in certain circumstances,particularly for sensitivity analysis. Until then, MLand MI under the MAR assumption represent thepractical state of the art.

References

Allison, P. D. (1987). Estimation of linear models with in-complete data. In C. Clogg (Ed.), Sociological Method-ology 1987 (pp. 71–103). San Francisco: Jossey-Bass.

Allison, P. D. (2000). Multiple imputation for missing data:A cautionary tale. Sociological Methods and Research,28, 301–309.

Amemiya, T. (1984). Tobit models: A survey. Journal ofEconometrics, 24, 3–61.

Anderson, T. W. (1957). Maximum likelihood estimates forthe multivariate normal distribution when some observa-tions are missing. Journal of the American StatisticalAssociation, 52, 200–203.

Arbuckle, J. L., & Wothke, W. (1999). AMOS 4.0 user’sguide [Computer software manual]. Chicago: Smallwa-ters.

Bentler, P. M. (in press). EQS structural equations programmanual [Computer software manual]. Encino, CA: Mul-tivariate Software.

BMDP Statistical Software. (1992). BMDP statistical soft-ware manual. Los Angeles: University of CaliforniaPress.

Brand, J. P. L. (1999). Development, implementation andevaluation of multiple imputation strategies for the sta-tistical analysis of incomplete data sets. Unpublished

MISSING DATA 173

doctoral dissertation, Erasmus University, Rotterdam,The Netherlands.

Brown, R. L. (1994). Efficacy of the indirect approach forestimating structural equation models with missing data:A comparison of five methods. Structural Equation Mod-eling, 1, 287–316.

Bryk, A. S., Raudenbush, S. W., & Congdon, R. T. (1996).Hierarchical linear and nonlinear modeling with theHLM/2L and HLM/3L programs. Chicago: ScientificSoftware International.

Clogg, C. C., & Goodman, L. A. (1984). Latent structureanalysis of a set of multidimensional contingency tables.Journal of the American Statistical Association, 79, 762–771.

Collins, L. M., Flaherty, B. P., Hyatt, S. L., & Schafer, J. L.(1999). WinLTA user’s guide (Version 2.0) [Computersoftware manual]. University Park: The PennsylvaniaState University, The Methodology Center.

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A com-parison of inclusive and restrictive strategies in modernmissing-data procedures. Psychological Methods, 6, 330–351.

Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics.London: Chapman & Hall.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).Maximum likelihood estimation from incomplete data viathe EM algorithm (with discussion). Journal of the RoyalStatistical Society, Series B, 39, 1–38.

Diggle, P. J., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis (with discussion). Ap-plied Statistics, 43, 49–73.

Duncan, S. C., & Duncan, T. E. (1994). Modeling incom-plete longitudinal substance use data using latent variablegrowth curve methodology. Multivariate Behavioral Re-search, 29, 313–338.

Enders, C. K. (2001). The impact of nonnormality on fullinformation maximum-likelihood estimation for struc-tural equation models with missing data. PsychologicalMethods, 6, 352–370.

Gelman, A., Rubin, D. B., Carlin, J., & Stern, H. (1995).Bayesian data analysis. London: Chapman & Hall.

Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multipleimputation in mixture models for nonignorable nonre-sponse with followups. Journal of American StatisticalAssociation, 88, 984–993.

Graham, J. W., Cumsille, P. E., & Elek-Fisk, E. (in press).Methods for handling missing data. In J. A. Schinka &W. F. Velicer (Eds.), Comprehensive handbook of psy-chology: Vol. 2. Research methods in psychology. NewYork: Wiley.

Graham, J. W., & Donaldson, S. I. (1993). Evaluating inter-

ventions with differential attrition: The importance ofnonresponse mechanisms and use of followup data. Jour-nal of Applied Psychology, 78, 119–128.

Graham, J. W., & Hofer, S. M. (1991). EMCOV.EXE users’guide [Computer software manual]. Unpublished manu-script, University of Southern California, Los Angeles.

Graham, J. W., & Hofer, S. M. (2000). Multiple imputationin multivariate research. In T. D. Little, K. U. Schnabel,& J. Baumert (Eds.), Modeling longitudinal and multiple-group data: Practical issues, applied approaches, andspecific examples (pp. 201–218). Hillsdale, NJ: Erlbaum.

Graham, J. W., Hofer, S. M., Donaldson, S. I., MacKinnon,D. P., & Schafer, J. L. (1997). Analysis with missing datain prevention research. In K. Bryant, M. Windle, & S.West (Eds.), The science of prevention: Methodologicaladvances from alcohol and substance abuse research(pp. 325–366). Washington, DC: American Psychologi-cal Association.

Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (1996).Maximizing the usefulness of data obtained with plannedmissing value patterns: An application of maximum like-lihood procedures. Multivariate Behavioral Research, 31,197–218.

Graham, J. W., Hofer, S. M., & Piccinin, A. M. (1994).Analysis with missing data in drug prevention research.In L. M. Collins & L. Seitz (Eds.), Advances in dataanalysis for prevention intervention research (pp. 13–63). Washington, DC: National Institute on Drug Abuse.

Graham, J. W., & Schafer, J. L. (1999). On the performanceof multiple imputation for multivariate data with smallsample size. In R. Hoyle (Ed.), Statistical strategies forsmall sample research (pp. 1–29). Thousand Oaks, CA:Sage.

Graham, J. W., Taylor, B. J., & Cumsille, P. E. (2001).Planned missing data designs in the analysis of change. InL. M. Collins & A. Sayer (Eds.), New methods for theanalysis of change (pp. 335–353). Washington, DC:American Psychological Association.

Heckman, J. (1976). The common structure of statisticalmodels of truncation, sample selection and limited de-pendent variables, and a simple estimator for such mod-els. Annals of Economic and Social Measurement, 5,475–492.

Hedeker, D., & Gibbons, R. D. (1997). Application of ran-dom-effects pattern-mixture models for missing data inlongitudinal studies. Psychological Methods, 2, 64–78.

Heitjan, D. F., & Rubin, D. B. (1991). Ignorability andcoarse data. Annals of Statistics, 19, 2244–2253.

Hogan, J. W., & Laird, N. M. (1997). Model-based ap-proaches to analyzing incomplete longitudinal and failuretime data. Statistics in Medicine, 16, 259–272.


Hogg, R. V., & Tanis, E. A. (1997). Probability and statis-tical inference (5th ed.). Upper Saddle River, NJ: PrenticeHall.

Horton, N. J., & Laird, N. M. (1999). Maximum likelihoodanalysis of generalized linear models with missing co-variates. Statistical Methods in Medical Research, 8, 37–50.

Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputationin practice: Comparison of software packages for regres-sion models with missing variables. American Statisti-cian, 55, 244–254.

Ibrahim, J. G. (1990). Incomplete data in generalized linearmodels. Journal of the American Statistical Association,85, 765–769.

Insightful. (2001). S-PLUS (Version 6) [Computer soft-ware]. Seattle, WA: Insightful.

Jennrich, R. I., & Schluchter, M. D. (1986). Unbalanced re-peated-measures models with structured covariance ma-trices. Biometrics, 38, 967–974.

Joreskog, K. G., & Sorbom, D. (2001). LISREL (Version8.5) [Computer software]. Chicago: Scientific SoftwareInternational.

Kennickell, A. B. (1991). Imputation of the 1989 Survey ofConsumer Finances: Stochastic relaxation and multipleimputation. Proceedings of the Survey Research MethodsSection of the American Statistical Association, 1–10.

Kenward, M. G. (1998). Selection models for repeated mea-surements for nonrandom dropout: An illustration of sen-sitivity. Statistics in Medicine, 17, 2723–2732.

Kenward, M. G., & Molenberghs, G. (1998). Likelihoodbased frequentist inference when data are missing at ran-dom. Statistical Science, 13, 236–247.

King, G., Honaker, J., Joseph, A., & Scheve, K. (2001).Analyzing incomplete political science data: An alterna-tive algorithm for multiple imputation. American Politi-cal Science Review, 95, 49–69.

Laird, N. M. (1994). Discussion of “Informative drop-out inlongitudinal data analysis” by P. J. Diggle and M. G.Kenward. Applied Statistics, 43, 84.

Laird, N. M., & Ware, J. H. (1982). Random-effects modelsfor longitudinal data. Biometrics, 38, 963–974.

Lavori, P. W., Dawson, R., & Shera, D. (1995). A multipleimputation strategy for clinical trials with truncation ofpatient data. Statistics in Medicine, 14, 1913–1925.

Lindstrom, M. J., & Bates, D. M. (1988). Newton–Raphsonand EM algorithms for linear mixed-effects models forrepeated-measures data. Journal of the American Statis-tical Association, 83, 1014–1022.

Lipsitz, S. R., & Ibrahim, J. G. (1996). Using the EM algo-rithm for survival data with incomplete categorical co-variates. Lifetime Data Analysis, 2, 5–14.

Lipsitz, S. R., & Ibrahim, J. G. (1998). Estimating equationswith incomplete categorical covariates in the Cox model.Biometrics, 54, 1002–1013.

Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger,R. D. (1996). SAS system for mixed models. Cary, NC:SAS Institute.

Little, R. J. A. (1993). Pattern-mixture models for multi-variate incomplete data. Journal of the American Statis-tical Association, 88, 125–134.

Little, R. J. A. (1994). A class of pattern-mixture models fornormal missing data. Biometrika, 81, 471–483.

Little, R. J. A. (1995). Modeling the dropout mechanism inrepeated-measures studies. Journal of the American Sta-tistical Association, 90, 1112–1121.

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysiswith missing data. New York: Wiley.

Liu, M., Taylor, J. M. G., & Belin, T. R. (2000). Multipleimputation and posterior simulation for multivariatemissing data in longitudinal studies. Biometrics, 56,1157–1163.

Madow, W. G., Nisselson, J., & Olkin, I. (Eds.). (1983).Incomplete data in sample surveys: Vol. 1. Report andcase studies. New York: Academic Press.

McArdle, J. J., & Hamagami, F. (1991). Modeling incom-plete longitudinal and cross-sectional data using latentgrowth structural models. In L. M. Collins & J. C. Horn(Eds.), Best methods for the analysis of change (pp. 276–304). Washington, DC: American Psychological Associa-tion.

McCullagh, P., & Nelder, J. A. (1989). Generalized linearmodels. London: Chapman & Hall.

McLachlan, G. J., & Krishnan, T. (1996). The EM algo-rithm and extensions. New York: Wiley,

McLachlan, G. J., & Peel, D. (2000). Finite mixture models.New York: Wiley.

Meng, X. L. (1994). Multiple-imputation inferences withuncongenial sources of input (with discussion). StatisticalScience, 10, 538–573.

Meng, X. L. (1999, October). A congenial overview andinvestigation of imputation inferences under unconge-niality. Paper presented at International Conference onSurvey Nonresponse, Portland, OR.

Multilevel Models Project. (1996). Multilevel modeling ap-plications—A guide for users of MLn. [Computer soft-ware manual]. London: University of London, Institute ofEducation.

Muthen, B., Kaplan, D., & Hollis, M. (1987). On structuralequation modeling with data that are not missing com-pletely at random. Psychometrika, 55, 107–122.

Muthen, L. K., & Muthen, B. O. (1998). Mplus user’s

MISSING DATA 175

guide [Computer software manual]. Los Angeles: Mu-then & Muthen.

Neale, M. C., Boker, S. M., Xie, G., & Maes, H. H. (1999).Mx: Statistical modeling (5th ed.) [Computer software].Richmond: Virginia Commonwealth University, Depart-ment of Psychiatry.

Nesselroade, J. R., & Baltes, P. B. (1979). Longitudinal re-search in the study of behavior and development. NewYork: Academic Press.

Neyman, J. (1937). Outline of a theory of statistical estima-tion based on the classical theory of probability. Philo-sophical Transactions of the Royal Society of London,Series A, 236, 333–380.

Neyman, J., & Pearson, E. S. (1933). On the problem ofmost efficient tests of statistical hypotheses. Philosophi-cal Transactions of the Royal Society of London, SeriesA, 231, 289–337.

Raghunathan, T. E., Solenberger, P. W., & Van Hoewyk, J.(2000). IVEware: Imputation and variance estimationsoftware. Ann Arbor: University of Michigan, Institutefor Social Research, Survey Research Center.

Robins, J. M., & Rotnitzky, A. (1995). Semiparametric ef-ficiency in multivariate regression models with missingdata. Journal of the American Statistical Association, 90,122–129.

Robins, J. M., Rotnitzky, A., & Scharfstein, D.O. (1998).Semiparametric regression for repeated outcomes withnon-ignorable non-response. Journal of the AmericanStatistical Association, 93, 1321–1339.

Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estima-tion of regression coefficients when some regressors arenot always observed. Journal of the American StatisticalAssociation, 89, 846–866.

Rubin, D. B. (1976). Inference and missing data.Biometrika, 63, 581–592.

Rubin, D. B. (1987). Multiple imputation for nonresponsein surveys. New York: Wiley.

Rubin, D. B. (1996). Multiple imputation after 18+ years.Journal of the American Statistical Association, 91, 473–489.

Rubin, D. B., & Thayer, D. (1983). More on EM for MLfactor analysis. Psychometrika, 48, 69–76.

Satorra, A., & Bentler, P. M. (1994). Corrections to teststatistics and standard errors in covariance structureanalysis. In A. von Eye & C. C. Clogg (Eds.), Latentvariables analysis: Applications for developmental re-search (pp. 399–419). Thousand Oaks, CA: Sage.

Schafer, J. L. (1997). Analysis of incomplete multivariatedata. London: Chapman & Hall.

Schafer, J. L. (1999a). Multiple imputation: A primer. Sta-tistical Methods in Medical Research, 8, 3–15.

Schafer, J. L. (1999b). NORM: Multiple imputation of in-complete multivariate data under a normal model [Com-puter software]. University Park: Pennsylvania State Uni-versity, Department of Statistics.

Schafer, J. L. (2001). Multiple imputation with PAN. InA. G. Sayer & L. M. Collins (Eds.), New methods for theanalysis of change (pp. 355–377). Washington, DC:American Psychological Association.

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputationfor multivariate missing-data problems: A data analyst’sperspective. Multivariate Behavioral Research, 33, 545–571.

Schafer, J. L., & Schenker, N. (2000). Inference with im-puted conditional means. Journal of the American Statis-tical Association, 95, 144–154.

Schafer, J. L., & Yucel, R. M. (in press). Computationalstrategies for multivariate linear mixed-effects modelswith missing values. Journal of Computational andGraphical Statistics.

Schimert, J., Schafer, J. L., Hesterberg, T., Fraley, C., &Clarkson, D. (2001). Analyzing missing values in S-PLUS. Seattle, WA: Insightful.

Schluchter, M. D., & Jackson, K. L. (1989). Log-linearanalysis of censored survival data with partially observedcovariates. Journal of the American Statistical Associa-tion, 84, 42–52.

Sinharay, S., Stern, H. S., & Russell, D. (2001). The use ofmultiple imputation for the analysis of missing data. Psy-chological Methods, 6, 317–329.

Song, J. (1999). Analysis of incomplete high-dimensionalnormal data using a common factor model. Unpublisheddoctoral dissertation, University of California, Los An-geles.

Stata. (2001). Stata user’s guide [Computer softwaremanual]. College Station, TX: Author.

Statistical Solutions. (1998). SOLAS for missing data analy-sis (Version 1). Cork, Ireland: Author.

Titterington, D. M., Smith, A. F. M., & Makov, U. E.(1985). Statistical analysis of finite mixture distributions.New York: Wiley.

Van Buuren, S., & Oudshoorn, C. G. M. (1999). Flexiblemultivariate imputation by MICE. Leiden, The Nether-lands: TNO Prevention Center.

Verbeke, G., & Molenberghs, G. (2000). Linear mixed mod-els for longitudinal data. New York: Springer-Verlag.

Vermunt, J. K. (1997). LEM: A general program for theanalysis of categorical data [Computer software]. TheNetherlands: Tilburg University, Department of Method-ology.

Wothke, W. (2000). Longitudinal and multi-group model-ing with missing data. In T. D. Little, K. U. Schnabel, &


J. Baumert (Eds.), Modeling longitudinal and multiple-group data: Practical issues, applied approaches, andspecific examples (pp. 219–240). Hillsdale, NJ: Erlbaum.

Yuan, K. H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analy-sis with nonnormal missing data. Sociological Methodol-ogy, 30, 165–200.

Yuan, Y. C. (2000). Multiple imputation for missing data:Concepts and new development. In Proceedings of the

Twenty-Fifth Annual SAS Users Group InternationalConference (Paper No. 267). Cary, NC: SAS Institute.

Zeger, S. L., Liang, K. Y., & Albert, P. S. (1988). Modelsfor longitudinal data: A generalized estimating equationapproach. Biometrics, 44, 1049–1060.

Received July 23, 2001Revision received January 15, 2002

Accepted January16, 2002 !

MISSING DATA 177

Documents

Missing Data: Our View of the State of the Art - ETH Z · Missing Data: Our View of the State of the Art Joseph L. Schafer and John W. Graham Pennsylvania State University ... seek