Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations

DAVID R. JOHNSON The Pennsylvania State University

REBEKAH YOUNG The Pennsylvania State University*

Toward Best Practices in Analyzing Datasets

with Missing Data: Comparisons

and Recommendations

Although several methods have been developedto allow for the analysis of data in the pres-ence of missing values, no clear guide existsto help family researchers in choosing amongthe many options and procedures available. Wedelineate these options and examine the sen-sitivity of the findings in a regression modelestimated in three random samples from theNational Survey of Families and Households(n = 250 – 2,000). These results, combined withfindings from simulation studies, are used toguide answers to a set of 10 common ques-tions asked by researchers when selecting amissing data approach. Modern missing datatechniques were found to perform better thantraditional ones, but differences between thetypes of modern approaches had minor effects onthe estimates and substantive conclusions. Ourfindings suggest that the researcher has con-siderable flexibility in selecting among modernoptions for handling missing data.

Within the last decade, the practice of analyz-ing data in the presence of missing values has

Department of Sociology, The Pennsylvania StateUniversity, 211 Oswald Tower, University Park, PA 16802([email protected]).

*Departments of Sociology and Demography, ThePennsylvania State University, 211 Oswald Tower,University Park, PA 16802 ([email protected]).

Key Words: maximum likelihood, methods, missing data,multiple imputation, National Survey of Families andHouseholds, regression.

been transformed by the availability of ‘‘mod-ern’’ methods designed to include informationfrom partial cases in the analysis. Adoptionof these techniques has been enhanced by theavailability of accessible sources (Acock, 1997;Allison, 2001; Schafer & Graham, 2002), whichdescribe modern methods for handling missingdata and the implementation of these methods instandard statistical software packages. Becausemost datasets analyzed by family researcherscontain variables with missing values, the needto solve problems posed by incomplete data iswidespread. Most researchers today are savvyto missing data issues and recognize the poten-tial for traditional methods to bias estimatesand reduce statistical power. Existing tools andguides smooth the road for the researcher, buta close look at the literature reveals a bewilder-ing variety of alternative methods and optionsand provides no clear guidelines for choosingamong them. Although some guides presenta semblance of a ‘‘best practices’’ for han-dling missing data (Acock, 2005; Enders, 2006;Graham, 2009; Howell, 2008), many questionsremain unanswered regarding the best strategiesto follow and the consequences of selecting oneapproach over another.

Our intention in this paper is to address sev-eral strategies for handling missing data andto provide guidance to the family researcherwho faces the need to choose among them. Wepresent an empirical example with real data toshow the respective consequences of these alter-native approaches for the research findings. We

926 Journal of Marriage and Family 73 (October 2011): 926 – 945DOI:10.1111/j.1741-3737.2011.00861.x

Analyzing Datasets with Missing Data 927

organize our paper around a set of 10 questionsthat researchers commonly raise when attempt-ing to determine an appropriate strategy forhandling missing data. Our goal is to help familyresearchers who are not quantitative methodol-ogists to make more informed decisions whenaccounting for missing data in their analyses.

MODERN MISSING DATA APPROACHES

Today’s common, state-of-the-art methods forhandling missing data have evolved in two direc-tions. The first involves imputing missing valuesto allow for the use of methods that require acomplete data matrix. The second involves esti-mating the joint distributions of all variables tomake use of both complete and incomplete caseswith a maximum likelihood (ML) procedure.

Imputation Methods

Compared with earlier methods of filling inmissing values, such as mean substitutionand regression imputation, modern imputationmethods are designed to account for the missingdata mechanism and adjust for the effects ofincomplete data on statistical inference. If themechanism responsible for the missing datais properly specified, modern methods yieldcorrect standard errors and unbiased parameterestimates. The mechanism most commonlyassumed is that the data are missing atrandom (MAR). Under the MAR assumption,the probability that a value is missing doesnot depend on the true value for that caseafter controlling for observed variables (Allison,2001). Imputed values generated using modernmethods take into account the uncertaintyintroduced by the presence of missing values.Each imputed missing value includes a randomerror component that is proportional to the extentthe other variables in the imputation modelcannot predict its true value. Hence, missingvalues imputed on variables that are stronglyrelated to other variables within the modelwill have only small random error components,whereas variables with weaker relationships willhave more error added to their imputed values.

When analyzing data with a single imputation,it is important to remember that the standarderrors and significance tests will be biased;the uncertainty associated with each imputedvalue is underestimated, and statistical equationscannot distinguish between the observed and

imputed values. Multiple imputation (MI) offersa solution to this problem. In this approach,multiple datasets are created that vary fromone another only in the imputed values theycontain. The differences in the missing valuesassigned in each dataset reflect the degree ofuncertainty involved in imputing the values.Statistical analysis is conducted separately oneach dataset and then combined to produce asingle set of pooled estimates. Finally ‘‘Rubin’srules’’ (Rubin, 1987) and related procedures areused to adjust the standard errors for uncertainty.

Many algorithms have been proposed toimpute missing values, but two approaches havebeen widely adopted and are available in thestatistical packages commonly used by fam-ily researchers. The first approach is based onMarkov Chain Monte Carlo (MCMC) meth-ods and the second on chained equations. TheMCMC approach uses a ‘‘normal’’ statisticalmodel that assumes that the missing valuesfollow a MAR pattern and all the variablesin the model are continuous with a multivari-ate normal distribution (Rubin, 1987; Schafer,1997). Categorical variables can be includedas sets of dummy variables and ordinal vari-ables are treated as continuous. The ‘‘normal’’assumption has been found to be robust evenwhen many of the variables are not contin-uous or do not have a multivariate normaldistribution (Lee, 2010; Schafer & Graham,2002). The first widely used implementationof this approach was in the public domainNORM software program (http://www.stat.psu.edu/∼jls/misoftwa.html). It has also been imple-mented in the SAS MI and Stata MI procedures.The chained equations approach (also referredto as Fully Conditional Specification, or FCS)imputes missing values by iteratively fitting a setof regression equations where each variable issuccessively treated as the outcome variable andregressed on all other variables in the model. Theset of regression equations is used to predict val-ues, random error components are added to thevalues, and the values are substituted for the val-ues that were missing. Each successive iterationuses the imputed values from the previous itera-tion in its equations. In this approach, the chainedregression models can be tailored to correspondto the level of measurement of the variable.For example, binary variables are estimatedusing logistic regression, categorical variableswith three or more categories by multinomial

928 Journal of Marriage and Family

regression, and ordered categorical variables byordinal regression.

Instead, all variables can also be treated ascontinuous, in which case the imputed esti-mates would approximate those obtained withthe ‘‘normal’’ model. The most widely usedimplementations of the approach are the ICEprocedure in Stata, IMPUTE in the IVEware sta-tistical package available for free download fromthe University of Michigan Center for SurveyResearch website (http://www.isr.umich.edu/src/smp/ive/), and the MI module availableas a extra cost option in recent versions ofSPSS(PASW). Although each of these proce-dures uses a chained equation approach, thealgorithms used and the options available differin ways that may affect the estimates obtained.

Combining Multiple Imputed Datasets

Conducting analyses in multiple datasets andpooling the results to produce single estimatesof the coefficients and standard errors can be abarrier to using MI approaches. Although manu-ally combining the results from each dataset is anoption, pooling the estimates can be automatedby procedures available in SPSS, SAS, and Stata.For example, when the researcher executes anordinary least squares (OLS) regression model,MIANALYZE in SAS and MICOMBINE inStata can be used to produce a single set ofcoefficients and corrected standard errors andsignificance tests. Of course, many of the mod-els used by family researchers extend beyondthe OLS approach (e.g., hierarchical linear mod-eling, fixed effects time series regression), andautomated pooling procedures are not currentlyavailable for all techniques. The range of anal-ysis models that use multiple-imputed datasetsthat can be automatically combined has beenexpanded greatly by the MIM prefix in theStata ICE program. Stata’s MIM works for morethan 40 different statistical procedures, includingmany panel (xt) and survey (svy) applications(Carlin, Galati, & Royston, 2008). MIANA-LYZE in SAS and the SPSS Multiple Imputationmodule also generate pooled estimates for manystatistical procedures.

Maximum Likelihood Approaches

The second set of approaches for conduct-ing multivariate analysis with incomplete datadoes not impute the missing values. Instead,

the parameter estimates and standard errors ofthe multivariate analyses are solved using amaximum likelihood (ML) estimation methodthat permits an incomplete data matrix. In ML,parameter estimates are selected that maximizethe probability of observing the data that were,in fact, collected. The probability or probabilitydensity is called the likelihood function. Missingdata are removed from the likelihood func-tion mathematically and effectively treated asunknown random variables to be averaged over(under the assumptions of MAR and multivariatenormality). Likelihood functions typically havea complicated form and are solved by specialtechniques such as expectation maximization(EM) algorithms.

A common ML approach, which has evolvedin structural equation model (SEM) software,directly maximizes the likelihood for a speci-fied model with incomplete data. This approachis often referred to as full information max-imum likelihood (FIML). One impediment tothe greater use of the FIML approach is that,because it has primarily been implemented inSEM packages (e.g., Mplus, Amos, LISREL,EQS), it has largely been viewed as a tech-nique for use in structural equation models.The method is less restrictive than researcherstypically assume; many types of regression,including binary outcome models, multilevelanalysis, and other limited dependent variablesmodels, can be estimated using FIML.

COMPARING THE APPROACHES

There appears to be an emerging consensus inrecent literature that the application of MI andFIML methods are superior to other approacheswhen analyzing datasets with missing values(Acock, 2005; Howell, 2008; Schafer & Graham,2002). When choosing between these missingdata strategies, a researcher must make manydecisions according to the specific research situ-ation. For example, researchers must select soft-ware, which variables to include in the model, thenumber of imputed datasets, whether to tailor themodel to the measurement level of the variables(or to use the fully normal model), and whether toimpute the dependent variable. A careful read-ing of the increasingly voluminous literatureon modern missing data methods yields manyrecommendations and guidelines, but familyresearchers may be uncertain how these apply tothe situations they typically confront. A practical


concern is that the software packages with whichfamily researchers are most familiar restrictthe choices of missing data methods. Using aless-familiar package to handle missing datamay be, understandably, impractical. Softwareconsiderations aside, the fundamental issue ofconcern to the researcher is the impact of thechoice of methods on the substantive findings.If choosing one strategy over another renders nosubstantively meaningful difference in the coef-ficients or their statistical significance, then wewould expect the researcher would be comfort-able with a method, even though the missing dataliterature might deem one option as preferable.

Ideally, researchers could assess the sensitiv-ity of their substantive findings to the missingdata approaches by repeating their analyseswith different methods. Finding that substantiveresults were unaffected by the choice of meth-ods would increase confidence that the missingdata approach did not bias the results. Althoughconducting sensitivity analysis in each studyis impractical, one valuable contribution to theliterature would be the application of severalmissing data approaches to a research problemthat used a dataset commonly used by familyresearchers. One of our goals is to conduct thissensitivity analysis.

In attempting to develop a set of guidelinesfor handling missing data in family research, wewill also focus on a set of issues and questionslikely to be raised as the researcher makes deci-sions about a missing data strategy. For eachissue, we will explore what the literature sug-gests about the best choices and then show theconsequences of these choices for an empiricalresearch problem. We chose to use data fromthe National Survey of Families and Households(NSFH) for two reasons. First, they have beenwidely used in the family literature. Second, thedegree of missing data found in the NSFH istypical of the pattern and the extent of missingobservations found in many studies analyzingsurvey data (e.g., the National Survey of FamilyGrowth and the Marital Instability over the LifeCourse, Add Health, and Fragile Families andChild Wellbeing studies).

METHOD

The NSFH Dataset

To represent different data circumstancesencountered by researchers and to test whether

the choice of missing data strategies might beinfluenced by the size of the sample, we createdthree completely random subsamples from theoriginal NSFH (first wave) dataset. Each datasetcontains the observed level of missing data inthe entire sample, which increased slightly inthe data-cleaning process, and differs only in itssample sizes—2,000, 500, and 250 cases, respec-tively. We tested all missing data approaches oneach of the three subsamples.

The substantive problem we have selected touse as an example in this paper is a regressionmodel of predictors of marital happiness. Theoutcome variable, marital happiness, was askedof all married respondents with the question:‘‘Taking things all together, how would youdescribe your marriage?’’ Responses were on ascale of 1 (very unhappy) to 7 (very happy). Inde-pendent variables were selected that predictedmarital happiness, varied in the proportion of thevalues in the variable that were missing, and var-ied by level of measurement. Table 1 presentsdescriptive information on these variables.

The percent missing was highest for coitalfrequency (23%) and for total household income(19% to 27%), which also had the largest rangeof missingness across the three subsamples. TheCenter for Epidemiological Studies Depressionscale (CES – D) is a summated scale of the 10items included in the NSFH. Because we wantedto maximize the observed missingness, we codedthe summed scale as missing if any item wasmissing. Alternative methods are available forhandling missing values in scale items—severalof which lead to a lower percent missing (Schafer& Graham, 2002). Two items, gender andnumber of children, contained no missing values.

Analysis Strategy

We used the regression model on the NSFHdata to evaluate the sensitivity of the model esti-mates to the choice of missing data strategy. Weestimated the model using several approachesin the three NSFH subsamples. Marital happi-ness was the dependent variable, and the 13variables listed in Table 1 were predictors inthe regression model, with one interaction term(female × chores) and a polynomial (years mar-ried squared) added to the equation. The datawere analyzed without using the weights avail-able in the NSFH—although most of the missingdata approaches described here can be used withweighted data.


Table 1. Descriptive Information for Analysis Model Variables

Subsample % Missing for 3 Data Subsamples

(n = 2,000) (n = 2,000) (n = 500) (n = 250)

Variablesa M SD Range % % %

Marital happiness 6.0 1.4 1 – 7 5.0 5.2 4.0Male = 0, (female) = 1 0.5 0.5 0 – 1 0.0 0.0 0.0(Years married) 18.0 15.5 0 – 68 4.7 3.0 5.2Number of (children) ages 5 – 18 0.7 1.0 0 – 3 0.0 0.0 0.0Household (income) $42,587.0 $50,325.5 $351 – $989,451 19.9 22.2 26.8Years of (education) 12.7 3.2 0 – 20 0.3 0.0 0.4Does the (wife work) 0.7 0.5 0 – 1 0.9 0.6 0.8Fairness of household (chores) 2.9 0.7 0 – 5 3.6 4.2 2.8(Religious) fundamentalism 2.8 1.2 1 – 5 10.0 8.4 8.0CES-D (depression) scale 12.9 15.9 0 – 84 5.5 6.6 6.0(Self-concept) 2.0 0.8 1 – 5 6.0 5.8 4.4Self-rated (health) 4.1 0.9 1 – 5 4.9 4.6 4.0Attitude about (cheating) 3.9 1.1 1 – 5 6.5 6.0 3.6(Coitus) frequency 2.1 1.3 0 – 5 22.8 23.2 22.8Complete case sample size 1,065 273 122

Note: Descriptive statistics are unweighted. Means, standard deviations, minimums, and maximums were similar across allthree subsamples; therefore descriptive statistics are only shown for the n = 2,000 subsample.aWords in parentheses indicate variable names in subsequent tables.

We have organized the presentation of theresults around a series of questions researcherscommonly consider when deciding on an appro-priate strategy for handling missing data. Foreach issue, we describe the current state ofknowledge and then present the findings fromour comparisons, followed by recommendationsbased on previous literature and our findings.

RESULTS

Question 1: Are the ‘‘modern’’ imputationand FIML methods less biased than‘‘traditional’’ methods for the typesof analytical models routinely used

in quantitative family research?

Many studies have shown that ‘‘traditional’’methods, such as complete case analysis (i.e.,listwise or casewise deletion), pairwise deletion,mean substitution, the use of indicators for miss-ing values (i.e., dummy variable adjustment),and related methods should be avoided becausethey yield biased estimates or incorrect standarderrors (Acock, 2005; Allison, 2001; Howell,2008; Raghunathan, 2004; Schafer & Graham,2002). When the amount of missing data is smallor is distributed primarily at random, some tra-ditional methods might yield sound estimates

without the added complexity introduced byapplication of modern approaches. Completecase analysis, for example, has been shown toyield unbiased coefficients when the missing val-ues are missing completely at random (MCAR)(Allison). The procedure may still have undesir-able consequences because the standard errorsmay be unnecessarily large as a result of reducedsample size (Allison). Some researchers arguethat even if the missing data are not MCARand the amount of missingness is small, thenthe choice of modern or traditional methods isnot likely to make much difference for the sub-stantive conclusions. How might the researcherascertain whether the amount of missingness issmall enough to have no effective difference?Published simulations proving bias often com-pare methods when the proportion missing ineach variable is as high as 50% (e.g., Allison).But do these simulations provide a reasonableapproximation of the amount of difference inestimates obtained from incomplete cases infamily datasets? We attempted to address thisquestion by comparing regression estimates withmissing data handled under traditional and mod-ern approaches in the larger NSFH subsample.


Table 2. Different Methods and Software (Subsample, n = 2,000)

‘‘Traditional Methods’’ ‘‘Modern Methods’’

Complete CaseAnalysis

(n = 1,065)

MeanSubstitution(n = 2,000)

IncomeIndicator

(n = 1,261)FIML

(MPlus)a

Multiple Imputation(Stata ICE)b

(n = 2,000)

Variables B SE (B) B SE (B) B SE (B) B SE (B) B SE (B)

Femalec 0.065 0.086 0.032 0.062 0.048 0.081 0.032 0.065 0.029 0.066Years marriedc 0.006 0.004 0.001 0.003 0.002 0.003 0.002 0.003 0.002 0.003Years married2c,d 0.320 0.204 0.348 0.131∗∗ 0.428 0.182∗ 0.386 0.144∗ 0.378 0.148∗

Children −0.118 0.042∗∗ −0.075 0.031∗ −0.123 0.039∗∗ −0.089 0.033∗ −0.089 0.033∗∗

Income (log)c −0.129 0.121 −0.043 0.089 −0.108 0.119 −0.082 0.101 −0.075 0.104Educationc 0.001 0.016 −0.021 0.010∗ −0.013 0.014 −0.023 0.011∗ −0.023 0.011∗

Wife work −0.152 0.098 −0.159 0.068∗ −0.170 0.090 −0.174 0.073∗ −0.176 0.075∗

Choresc 0.204 0.059∗∗ 0.207 0.046∗∗∗ 0.210 0.056∗∗∗ 0.209 0.047∗∗∗ 0.207 0.048∗∗∗

Religious 0.125 0.034∗∗∗ 0.061 0.026∗ 0.103 0.031∗∗ 0.067 0.027∗ 0.068 0.027∗

Depression −0.010 0.003∗∗∗ −0.009 0.002∗∗∗ −0.013 0.003∗∗∗ −0.010 0.002∗∗∗ −0.010 0.002∗∗∗

Self-concept −0.229 0.052∗ −0.231 0.038∗∗∗ −0.255 0.048∗∗∗ −0.224 0.040∗∗∗ −0.224 0.040∗∗∗

Health 0.112 0.050∗∗ 0.148 0.036∗∗∗ 0.121 0.045∗∗ 0.152 0.038∗∗∗ 0.153 0.038∗∗∗

Cheating 0.126 0.036∗∗ 0.115 0.026∗∗∗ 0.111 0.033∗∗ 0.116 0.027∗∗∗ 0.114 0.027∗∗∗

Coitusc 0.190 0.034∗∗∗ 0.183 0.026∗∗∗ 0.199 0.032∗∗∗ 0.192 0.028∗∗∗ 0.190 0.029∗∗∗

Female × chores 0.225 0.120 0.270 0.094∗ 0.249 0.115∗∗ 0.260 0.097∗∗ 0.256 0.098∗∗

Constant 5.410 0.309∗∗∗ 5.437 0.224∗∗∗ 5.575 0.289∗∗∗ 5.417 0.239∗∗∗ 5.417 0.249∗∗∗

R-squared .151 .137 .176 .150 .148

aFull information maximum likelihood estimation (with no auxiliary information). bMultiple imputation model with m = 25datasets, imputation model informed by all analysis model variables. cVariable is centered on the mean. dCoefficient multipliedby 1,000.

∗p < .05. ∗∗p < .01. ∗∗∗p < .001.

Table 2 shows the results from regressionanalyses for the NSFH (n = 2,000) subsam-ple, using five approaches to handling missingdata. The model includes all the variables fromTable 1, with marital happiness as the depen-dent variable. Two variables, total householdincome and coital frequency, were transformedto normalize the distribution: Income was loggedand coital frequency was squared. The squaredcoefficient for years married was multiplied by1,000 for readability, and several variables werecentered at their means, including all variablesinvolved in the interaction and polynomial terms.

The first three sets of models in the tableused traditional approaches: complete case anal-ysis, mean substitution, and dummy variableadjustment. The model using complete case anal-ysis excluded any case containing missing data,reducing the sample size by almost half. In themodel using mean substitution, missing valueswere replaced with the variable mean. For thedummy variable adjustment model, respondentsmissing on income were assigned the mean, and

a dummy indicator included in the regressionequation (0 = present, 1 = missing; Cohen &Cohen, 1985). The regression model was thenestimated by complete case analysis. The coef-ficient for the missing income indicator (notshown) was statistically significant (p < .05).

The first ‘‘modern method’’ regression modelwas estimated by full information maximumlikelihood (FIML) in Mplus. The second ‘‘mod-ern’’ model shown in the table was estimatedwith multiple imputation in Stata ICE with 25datasets (m = 25), and all variables were treatedas continuous. In both models, the missing datawere informed only by the analysis variablesincluding the interactions and polynomial terms.

Around two thirds of the independent vari-ables included in the regression models werestatistically significant (p < .05) and the aver-age explained variance in the models was about15%. For this paper, the specific substantivefindings are of little concern. Rather, our focusis on the consistency of the coefficients andstandard errors across methods.


Close examination of the two modernapproaches found them to be remarkably consis-tent in the values of b-coefficients, the mag-nitude of standard errors, and the level ofsignificance obtained. For researchers worriedthat MI is tantamount to ‘‘making up data,’’the nearly identical results produced by MI andFIML, which does not impute values, shouldalleviate that concern. The consistency of theestimates found in these modern methods wasnot shared by the three models estimated usingtraditional approaches.

In the complete case analysis model, all vari-ables had larger standard errors than found in themodern approaches, which largely reflects thesmaller sample size. Three variables (the squaredterm for years married, the female × choresinteraction, and the wife’s work status) werestatistically significant (p < .05) when modernmethods were used, but not in the complete caseanalysis model. Overall, the loss of statisticalpower was the major limitation of the completecase analysis approach, as the magnitude anddirection of the coefficients in the models weresimilar.

The mean substitution approach performedwell, with more of the estimates statisticallysignificant than in the complete case analysismodel, but the standard errors were smaller thanin any of the other approaches. It is knownthat mean substitution underestimates item vari-ances, which inflates statistical power and resultsin biased significance tests. Estimates in thedummy variable adjustment model were sim-ilar to the complete case analysis model,with slightly smaller standard errors. Sufficientresearch has shown both mean substitution anddummy variable adjustment to be biased and out-dated methods (Allison, 2001; Little & Rubin,2002). Although we would not recommend thesemethods, this example shows that previouslypublished research that handled missing data inthese ways should not be discounted; the resultsof these approaches may be robust given the typeof missingness typically found in family-relateddata.

Question 1 Recommendations

Although the substantive interpretations onecould draw from the traditional approaches were,in many ways, similar to those from the mod-ern ones, there were enough differences in thepatterns of significant effects and the size and

direction of the coefficients to raise serious con-cerns about the validity of continued use of theseapproaches by family researchers. We concurwith the conclusions of many other studies (e.g.,Allison, 2001) that these methods should beavoided due to the high likelihood of obtainingbiased estimates or loss of statistical power.

Question 2: In what situations should theresearcher use FIML methods and in what

situations should MI methods be used?

The multiple imputation (MI) and FIMLapproaches are identified in the literature as thepreferred modern methods of handling missingdata (Acock, 2005; Enders, 2006; Graham, 2007,2009). These approaches are closely relatedtheoretically, and neither approach is inherentlybetter than the other. In fact, MI and FIMLwill produce equivalent results when the inputdata and models are the same and the numberof imputed datasets in MI is sufficiently large(Collins, Schafer, & Kam, 2001; Graham, 2003).Our comparison of these two approaches inTable 2 confirms that they produced nearlyidentical results. Practical considerations—suchas software limitations—often facilitate a choicebetween the two, although there are several otherpoints to consider.

FIML deals with missing data and parameterestimation in one step, eliminating the need tocreate imputed values. To date, the approachhas been primarily implemented in structuralequation software, and its use has been limitedto use in structural equation models (SEM).Nevertheless, the range of statistical modelsthat can be analyzed with FIML in SEMprograms (Muthen & Muthen, 2010) is largerthan generally realized. For example, Mplus canuse FIML to estimate survival analysis modelsand models with binary dependent variables andcan handle stratified, clustered, and weightedsurvey data. Although MI approaches to somecommon data structures, such as those usedin multilevel models, are still in need of furtherdevelopment (Yucel, Schenker, & Raghunathan,2007), MI can be used with a larger variety ofstatistical models than those supported by FIML.

Both FIML and MI approaches have softwareimplementations that allow models accountingfor missing data to be informed by a larger set ofvariables than those included in the analysis. Instructural equation programs such as Mplus, thisis accomplished by adding auxiliary variables


that condition the estimates of the covariancematrix without entering them into the analysismodel (Acock, 2005; Graham, 2003). Incorpo-rating auxiliary information into a MI techniqueis simplified because the imputation model is aseparate step from the analysis model and eachstep can have different sets of variables.

A final point of comparison is that MI isa data-based technique and FIML is a direct-model-based technique. FIML is limited in thisrespect because estimates of the same param-eters and their confidence intervals may varyfrom analysis to analysis (Myung, 2003). Withdata assigned by imputation, on the other hand,estimates can be calculated as if the dataset werecomplete. Results obtained from different analy-sis models using MI will be mutually consistent,a feature that is not the case with FIML (Little& Rubin, 2002). When multiple researchers areto conduct analysis with the same dataset, MImay be a more attractive option simply becauseit allows analysis to be done on a collection ofregular data matrices.

One aspect of MI that has generated con-cern in recent years is that the imputation stepmight use more correct information than theanalysis model, which could create supereffi-ciency (Rubin, 1996). This type of situationis generally referred to as ‘‘uncongeniality’’(Meng, 2001; Robins & Wang, 2000). If uncon-geniality occurs, it usually leads to conservativeinferences and, under special circumstances,could yield invalid results. Although researchersshould be aware of this potential problem, itis most relevant for survey statisticians whoare imputing a large dataset for public releaseand irrelevant for researchers who are imputingwith a theoretical model in mind (Kenward &Carpenter, 2007). The actual risk that unconge-niality occurs for family researchers is probablyquite low and is not a ‘‘disadvantage’’ comparedto FIML. The practical conclusion is that MI canbe done safely even when the ultimate user maybe applying models or analysis not contemplatedby the imputer (Little & Rubin, 2002).

The regression models presented in Table 2show our application of both FIML and MImethods using standard software implementa-tions. Either technique, when applied to thesedata, yielded similar coefficients and standarderrors. This finding held for all three datasetconditions, though the estimates differed mostfor the n = 250 dataset. Under all conditions,if we had selected any one of these approaches

to test hypotheses and draw substantive conclu-sions, the conclusions drawn from our resultswould have been the same.


The choice of using a FIML or an imputationstrategy for handling missing data may bedetermined by practical considerations. Suchconsiderations include access to the software andwhether the specific planned analysis models canbe carried out with the missing data approachselected. To date, the MI approach has theadvantages of being more flexible and applicableto a wider variety of models. We anticipatethat further developments in FIML algorithmswill broaden the choice of compatible statisticalmodels. The limitation of the imputationapproach is the extra effort required to generateimputed datasets, although complete datasetscan also be an advantage because analysis resultswill be consistent across models.

Question 3: Does the ‘‘modern’’imputation-based software implementation

used affect the results?

The three imputation approaches used most oftenby family researchers are the normal-MCMCprocedures (as implemented in SAS MI andStata MI) and the chained-equation procedure(as implemented in Stata ICE and SPSS MI).Recent simulation studies (Lee, 2010) foundthat the MCMC and chained-equation multipleimputation approaches yield similar results. Ourcomparisons using the NSFH datasets confirmedthis conclusion.

In Table 3, we compare SPSS MI, SAS MI,and Stata ICE results to the FIML model.Because of the large number of regression mod-els needed to compare different approaches, wepresent summary data on the differences betweenthe models rather than include all the regressioncoefficients, standard errors, and other resultsfrom each model. A dataset including all coef-ficients from the regression models is availablefrom the authors upon request. After exploringseveral approaches for summarizing data frommany models, the one that appeared most infor-mative and practical for our purposes was toselect one model as the ‘‘standard’’ and com-pare the estimates for all other models with thisstandard. This approach is similar to that usedin simulation studies except that we do not have


Table 3. Comparison of Different Imputation Conditions to FIML Model

Difference in b-Coefficients Difference in t-Statistics

Meana SD Range Meanb SD Min Max

SoftwareStata ICE 0.002 0.000 0.003 −0.063 0.023 −0.327 0.081SAS PROC MI 0.002 0.000 0.005 −0.014 0.022 −0.143 0.169SPSS MI 0.002 0.000 0.004 −0.027 0.021 −0.128 0.112

Single imputation in SPSS EMn = 2,000 0.009 0.004 0.056 0.391 0.121 −0.080 1.902n = 500 0.013 0.005 0.069 0.237 0.097 −0.123 1.374n = 250 0.012 0.004 0.056 0.095 0.051 −0.227 0.661

Number of datasetsn = 2,000

m = 1 0.012 0.003 0.039 0.316 0.167 −0.335 2.011m = 5 0.006 0.001 0.016 0.001 0.064 −0.399 0.472m = 10 0.004 0.001 0.008 −0.022 0.049 −0.460 0.329m = 25 0.002 0.000 0.003 −0.063 0.023 −0.327 0.081m = 100 0.001 0.000 0.005 −0.030 0.014 −0.166 0.057m = 250 0.001 0.000 0.003 −0.017 0.009 −0.090 0.029

n = 500m = 1 0.025 0.005 0.068 0.036 0.104 −0.608 0.601m = 5 0.007 0.001 0.015 −0.062 0.079 −1.060 0.266m = 10 0.004 0.001 0.010 −0.040 0.046 −0.616 0.160m = 25 0.004 0.001 0.015 −0.030 0.029 −0.364 0.104m = 100 0.003 0.001 0.012 −0.023 0.019 −0.228 0.077m = 250 0.003 0.001 0.010 −0.026 0.013 −0.141 0.078

n = 250m = 1 0.066 0.011 0.117 0.476 0.180 −0.446 1.591m = 5 0.012 0.002 0.025 −0.090 0.061 −0.617 0.169m = 10 0.007 0.002 0.033 −0.117 0.031 −0.365 0.075m = 25 0.005 0.001 0.011 −0.071 0.019 −0.226 0.038m = 100 0.003 0.001 0.008 −0.060 0.015 −0.171 0.014m = 250 0.003 0.000 0.006 −0.048 0.014 −0.165 0.034

abdiff =∑ |bi.MI StdX−bi.FIMLStdX |

ncoefbtdiff =

∑ |ti.MI |−|ti.FIML|ncoef

the true estimates when no missing values arepresent.

Table 3 presents summary indicators for aset of regression models that were created fromcoefficient-to-coefficient comparisons to a corre-sponding standard model. In this table, we usedthe FIML model estimated in the same dataset asthe standard model. Although the FIML modelresults are not true estimates, they serve as auseful standard to assess the sensitivity of differ-ent multiple imputation options. We present themean of the absolute value of the differences inthe b-coefficients standardized on X by subtract-ing the FIML estimate from the model estimates.For example, the difference in the b-coefficients’

mean (bdiff ) was calculated with the followingformula:

bdiff =∑ |bi.MI StdX − bi.FIMLStdX |

ncoef

where the b-coefficients were standardized by Xusing the following formula:

biStdX = biXi ∗ σiXi

The b-coefficient for every independent vari-able in the regression model (denoted by biXi)was standardized on X by multiplying the


b-coefficient by the standard deviation of thatvariable (σiXi) to produce a standardizedb-coefficient (biStdX). Standardization was animportant step so that the different metricsof the independent variables did not differen-tially influence the summary measure. The meandifference was calculated by summing absolutevalue of the difference of each standardized b-coefficient from the FIML model (bi.FIMLStdX )subtracted from the MI estimates (bi.MI StdX )and dividing by the number of b-coefficients inthe regression model (ncoef ). These mean dif-ferences in b-coefficients, the standard deviationof these differences, and the range of the dif-ferences allowed us to examine the sensitivityand potential biases in the magnitude of theestimated effects.

Table 3 also includes the difference betweenthe MI and FIML t-statistics for each of theindependent variables in the model calculatedwith the following formula:

tdiff =∑ |ti.MI | − |ti.FIML|

ncoef

where ti is the t-statistic for each independentvariable in the model. We subtracted the absolutevalue of the FIML t-statistic from the absolutevalue of the MI t-statistic and averaged thisdifference across the 15 model coefficients (theconstant was excluded). The difference allowedus to infer that, when the mean differencein the t-statistics was positive, the specificmissing data model may have overestimated theuncertainty and produced unnecessarily largestandard errors. A negative difference in the t-statistics indicated greater statistical power orsmaller standard errors in the MI model.

The summary measures comparing MI soft-ware programs in Table 3 show that Stata ICE,SAS PROC MI, and SPSS MI gave similarestimates. The small mean differences in theb-coefficients (e.g., the mean b-coefficient dif-ference between SAS and the FIML modelwas .002), the standard deviation of the b-coefficient, and the limited range all suggest thatonly minor, probably random, differences existbetween the programs. The mean differences inthe t-statistics were all slightly negative, show-ing that the FIML model had slightly highert-values than the MI models.


Based on our review of the literature and ourfindings with the NSFH data, we conclude thatthe FIML approach or any of the multiple impu-tation software approaches tested here wouldyield similar substantive conclusions, at leastin data analyses with the degree of missingdata found in many of the large, national fam-ily surveys. The single imputation SPSS EMmethod did not perform as well, particularlywith respect to significance testing. The new MIprocedure in SPSS is a preferred alternative forresearchers who are most familiar with SPSS fortheir analyses.

Question 4: How many imputations are neededto produce stable and unbiased estimates?

Compared with imputing only a single value,multiple imputation has been shown to havedistinct advantages in the accuracy of stan-dard errors and significance tests. With onlya single imputed value, estimates cannot takeinto account the uncertainty introduced by theimputed data; therefore, the standard errors willbe underestimated. Assigning more than oneimputed value for each missing response allowsimputed values to vary depending on the degreeof uncertainty in their estimation and allowsstandard errors to be adjusted to reflect thisuncertainty.

In the past, most guides in the literature sug-gested that five imputed datasets were sufficientto account for this uncertainty. Recent evidencesuggests that more than five may be needed.Graham, Olchowsky, and Gilreath’s (2007) sim-ulations show that as the amount of missinginformation increases, additional imputations(sometimes up to 100) produce more stable esti-mates with greater statistical power. Whethermore imputed datasets make a difference for thekind of survey data that family researchers nor-mally analyze, and under what circumstancesdifferences are most likely to emerge, remainsuncertain. To help address this issue, we com-pare the results when the number of imputeddatasets varies.

Table 3 includes comparisons in which thenumber of imputations varies for each ofthe NSFH subsamples. Regardless of samplesize, the more datasets generated, the closerb-coefficients and t-statistics came to matchingthe FIML estimates. The singly imputed models


performed the poorest. We include comparisonswith the single-imputation SPSS EM proce-dure found in the MVA module in SPSS. Thisprocedure has seen substantial use by familyresearchers but has been criticized because ituses only a single imputation and the algorithmfails to adequately account for the uncertaintydue to missing data. As a result, SPSS EMyields biased estimates and incorrect standarderrors (von Hippel, 2004). The comparisons inTable 3 illustrate this problem. In this example,the coefficients from the single imputation SPSSEM program were not notably biased, but thestandard errors were smaller than those producedwith FIML (shown by the large mean differencein the t-statistics).

Similar to what we found for the SPSS EMprocedure, when only a single imputed datasetwas used (m = 1) in Stata ICE the error varianceof the imputed values was also underestimated,shown by the positive means difference inthe t-statistic (e.g., 0.316 for the n = 2,000subsample). This effect was most pronounced inthe largest and smallest samples. With a samplesize of 250, a greater number of imputed datasetsappeared to be most effective at improving thefit to the FIML model. A similar pattern wasfound for other imputation software we testedbut did not present here.

There is some evidence supporting the resultsof the simulation studies by Graham and col-leagues (2007) that a larger number of imputeddatasets improves statistical inference and thestability of the estimates, with the largest effectsnoted among smaller samples and amid greaterlevels of missing data. Graham and colleaguesprovide a table in his article that shows thenumber of imputations recommended for dif-ferent degrees of missing information in thedata matrix (Table 1, p. 208). Missing informa-tion is different from the proportion of missingobservations, because variables that have highercovariances with other variables in the modelhave less missing information than those withlower covariances. But calculating an estimateof missing information usually requires multipleimputations. Additionally, the estimate of miss-ing information is subject to sampling error whenthe number of imputed datasets is small. Forexample, we used SAS PROC MIANALYZE,which provides estimates of missing informationby variable, to estimate the missing informationfor the income variable. When these modelswere repeated a number of times with different

random seeds, the estimates of the missing infor-mation ranged from .07 to .38 when 5 imputeddatasets were used. We also estimated this valuewhen 50 to 100 imputed datasets were used. Theestimated missing information had converged to.31 by 50 datasets and remained at this level upto 100 datasets. Estimating missing informationrequires running many imputation models withvarying numbers of datasets and appears to beof limited utility in helping researchers choosethe number of imputed datasets to generate.


Our results suggest that using more than 10imputed datasets can improve the stability ofestimates, but the researcher is unlikely to makeerrors in the substantive interpretation of thefindings even if as few as 5 are used with alarge sample size. Nonetheless, we do recom-mend using more datasets (25+), particularlywith smaller sample sizes and larger amounts ofmissing data. With the increased speed and stor-age space of computers and the availability ofprocedures to pool the estimates, the researcher’sworkload is only marginally affected by generat-ing 25 instead of 5 datasets when working with arelatively small number of variables (∼20). Forexample, the SAS MI estimates for a model with20 variables generated 100 imputed datasets on arelatively fast Windows computer in less than 10minutes. Stata ICE (with the persist option) andthe SPSS MI module tend to run even faster whenimputing a relatively small number of variables.

Question 5: What variables should be includedto inform the procedure used to account for

missing values?

The issues of how many and what type of vari-ables should be used to inform missing dataprocedures have been widely discussed in theliterature. The general consensus is that the miss-ing data model should be at least as complete asthe analysis model (Acock, 2005; Collins et al.,2001; Graham, 2003). When a variable in theanalysis model is not used to inform the missingdata estimates, the imputed values for that vari-able are uncorrelated with other variables in themodel and the covariances are underestimated.This also applies to variables used to specifyinteraction effects and curvilinearity. This is notan issue with FIML because all variables in theanalysis model are included in the estimation


of the covariance matrix. With MI approaches,however, any interactions or polynomials in theanalysis model should be added to the impu-tation model (von Hippel, 2009). Simulations(von Hippel, 2009) have found that calculat-ing these terms from their component variablesafter imputation, instead of including them inthe imputation model, can lead to biased results.

A similar problem can arise when the imputeddata have been informed only by a subset ofthe variables in the analysis model. Some largedatasets, particularly those from the U.S. Cen-sus Bureau (e.g., the Current Population Survey),are released with imputed data, usually using ahot-deck approach that relies on a small set ofvariables to select donors for the missing values.This can lead to attenuation of the covariancesand biased results.

A final issue is the number and types ofadditional or auxiliary variables—beyond thoseincluded in the analysis model—that shouldbe included to model the missing data. Inten-tionally adding variables not included in theanalysis model to the missing data model hasbeen recommended (Acock, 2005; Collins et al.,2001; Enders, 2010; Graham, 2003) becausethey can improve the estimates and increasethe likelihood of meeting the MAR assumption.Simulation studies (Bauldry, 2010; Collins et al.,2001; Enders, 2010; Yoo, 2009) have found thatincluding auxiliary variables in the imputation orFIML model can yield more accurate and stableestimates. This effect depends on the type of aux-iliary variables used and how correlated they arewith variables in the analysis model. Moreover,auxiliary variables containing missing values arenearly as effective in reducing bias as those withno missing values (Enders, 2010). Including aux-iliary variables that are strongly correlated withvariables in the analysis model (r ∼ .7 – .9) canreduce bias and standard errors, but when thecorrelations are more modest (r < .4), the effecton the estimates and standard errors seems negli-gible (Collins et al., 2001; Enders, 2010; Enders& Peugh, 2004). Regardless, including themdoes no harm.

We explore, using five conditions, the effectsof different ways of informing the missingdata model. The first condition, ‘‘way under-informed,’’ imputes a dataset in which allanalysis-model variables are informed only bymarital happiness, gender, income, marital dura-tion, and number of children (this is simi-lar to what might be found in data imputed

by hot deck). The second condition, ‘‘under-informed,’’ uses all variables in the regressionmodel except the interaction and polynomialterms. This tests the situation where a researchercomputes polynomial or interaction terms afterimputing a dataset. The third condition, ‘‘justinformed,’’ includes the same variables in themissing data and analysis models, includinginteractions and polynomial terms. The fourthcondition, ‘‘partially informed,’’ includes in themissing data model all variables in the analysismodel, plus seven auxiliary variables selectedbecause of their conceptual relevance to andcorrelations with analysis-model variables. Thefinal condition, ‘‘fully informed,’’ includes anadditional 20 variables correlated at least .2 orhigher with at least one variable in the regressionmodel. The amount of missing data in the aux-iliary variables varied from 0% to 14% with anaverage of around 5%. Descriptive informationon the auxiliary variables, along with examplesof the software code we used for the missingdata models and the three NSFH datasets con-structed for this study, are available in the onlineversion of the article on Wiley Interscience,Appendices A and B.

Table 4 includes, for each sample, compar-isons of the degree to which the missing datamodels were informed. Models reported wereestimated in Stata ICE with m = 25, but SAS MIand SPSS MI results were similar. The first twoconditions could not be estimated in the FIMLapproach because the approach always includesall analysis-model variables in the missing dataestimation. Mplus also struggled to estimate the‘‘fully informed’’ model for some datasets. Thecomparison condition for all models was ‘‘justinformed.’’ The number of variables inform-ing the missing data model appeared to havegreatest influence on the estimates when anal-ysis variables were excluded from the miss-ing data model (‘‘way under-informed’’ and‘‘under-informed’’) although effects were small.Both models with auxiliary variables (‘‘partiallyinformed’’ and ‘‘fully informed’’) were simi-lar to the estimates when only analysis-modelvariables were included in the imputation, partic-ularly in the two larger datasets. The t-statisticsdiffered slightly, but no consistent differenceswere found in all three datasets. Adding auxil-iary variables had little effect in our example,a finding consistent with the simulation litera-ture showing patterned effects only when the


Table 4. Comparison of Results when the Number of Variables Informing the Missing Data Model Varied



How the missing data model was informedn = 2,000

Way under 0.005 0.001 0.011 −0.092 0.052 −0.455 0.257Under 0.007 0.001 0.017 −0.101 0.067 −0.649 0.479Just (comparison group)Partially 0.003 0.001 0.009 0.141 0.041 −0.002 0.542Fully 0.003 0.001 0.010 0.092 0.043 −0.159 0.509

n = 500Way under 0.007 0.001 0.021 −0.109 0.075 −1.080 0.270Under 0.015 0.003 0.037 −0.131 0.079 −0.817 0.538Just (comparison group)Partially 0.007 0.001 0.014 −0.088 0.083 −1.107 0.375Fully 0.009 0.003 0.037 −0.091 0.084 −1.034 0.357

n = 250Way under 0.009 0.002 0.021 −0.050 0.033 −0.290 0.211Under 0.026 0.005 0.079 −0.042 0.085 −0.784 0.399Just (comparison group)Partially 0.008 0.002 0.033 0.016 0.026 −0.181 0.211Fully 0.020 0.003 0.051 0.117 0.068 −0.460 0.474

abdiff =∑ |bi.MI StdX−bi.MI(JustInformed)StdX |

ncoef

btdiff =∑ |ti.MI |−|ti.MI(JustInformed)|

ncoef

auxiliary variables are highly correlated withvariables included in the model.


Our findings suggest that it is important that themissing data model be at least as complete as theanalysis model. Although the literature suggeststhe importance of including auxiliary variables(Collins et al., 2001), especially those that arehighly correlated with variables in the model, ouranalysis with a dataset and variables commonlyused by family researchers found little differ-ence in the substantive conclusions that would bedrawn with or without taking auxiliary variablesinto account. We believe this is an importantfinding because it suggests that both strategiesfor handling missing values in the family lit-erature are acceptable. One strategy involvesselecting from the larger dataset a subset ofvariables for the research problem and imputingthese variables before any analysis, even thoughonly a smaller subset is used in the completedanalysis. An example of this strategy can befound in Amato, Booth, Johnson, and Rogers

(2007), which included analyses with differentsets of variables; however, all analyses utilized acommon imputed dataset created in the manneroutlined above. The alternative to this approachis to handle missing data at each step of the anal-ysis. We believe either strategy is acceptable formost of the datasets used by family researchers.

Question 6: Should the imputation model betailored to fit the measurement level of the

variables or is use of a normal model for allvariables sufficient?

Many imputation techniques are model basedand make specific assumptions about the levelof measurement of the variables and the covari-ance structure. For example, the assumption ismade in some approaches (e.g., SAS MI, SPSSEM, Stata MI) that the variables are all con-tinuous and follow a multivariate normal distri-bution. There is substantial evidence (Demirtas,Freels, & Yucel, 2008; Lee, 2010; Schafer &Graham, 2002) that the normal model is robustto violations of the assumption of multivariatenormality. Other imputation methods (e.g., Stata


Table 5. Tailoring, Rounding, and Imputing the DV, Compared to Fully Normal Model With DV Imputed



Tailored modeln = 2,000 0.016 0.005 0.069 −0.335 0.110 −1.503 0.288n = 500 0.029 0.009 0.122 −0.260 0.115 −1.028 0.455n = 250 0.027 0.006 0.087 −0.072 0.088 −0.710 0.483

Roundedn = 2,000 0.013 0.005 0.071 −0.106 0.094 −0.865 0.479n = 500 0.011 0.003 0.047 −0.095 0.087 −0.959 0.376n = 250 0.026 0.009 0.149 0.007 0.051 −0.279 0.378

DV not imputedn = 2,000 0.012 0.004 0.060 −0.138 0.089 −0.915 0.292n = 500 0.023 0.008 0.107 −0.208 0.111 −1.126 0.419n = 250 0.030 0.006 0.099 −0.018 0.088 −0.627 0.596

DV imputed, then removedn = 2,000 0.002 0.000 0.003 0.029 0.018 −0.069 0.192n = 500 0.003 0.000 0.006 −0.060 0.074 −1.073 0.094n = 250 0.002 0.000 0.005 0.017 0.011 −0.029 0.139

abdiff =∑ |bi.MI StdX−bi.MI(FullyNormal)StdX |

ncoef

btdiff =∑ |ti.MI |−|ti.MI(FullyNormal)|

ncoef

ICE, SPSS MI, and IMPUTE in IVEware) cantailor the chained-regression equations used inthe estimation process to the measurement levelof the variables without requiring the normalityassumption. To evaluate whether the process oftailoring the model to the measurement levelaffects the results, we compared a tailored StataICE model with a model in which ICE treatedall variables as continuous. The comparisonsof these models are presented in Table 5 asthe Tailored Model. Six of the variables in theregression were not continuous and were esti-mated by multinomial or logistic regression inthe tailored model (gender, though categorical,had no missing data).

Our results showed that the tailored b-coefficients were similar to the normal modelestimates. The tailored model, however, pro-duced t-statistics smaller than those found inthe fully normal model, demonstrating that tai-loring may be less efficient. Surprisingly, whenwe examined individual coefficients, we foundthat this difference was not due to the t-statisticsof the categorical variables alone. Because wedo not know the true estimates, it is difficult toassess whether, in these cases, the tailored or thenormal model estimates came closer to the truepopulation. What we can see is that tailoringimputation models may yield less power for

the statistical tests than those obtained with thenormal model, at least under some conditions.


The tailored approach is a reasonable strat-egy when practical considerations restrict theuse of the normal model, particularly if theanalysis model includes only one or two cat-egorical variables. Because the normal modelin which all variables are treated as continuoushas been shown to perform quite well, evenwith variables that are not continuous, its usein situations with mixed levels of variables isacceptable. If the researcher’s analysis plansinclude displaying the frequency distributionsof the variables or cross-tabulations, the tailoredmethod has the advantage of yielding imputedvalues that more closely reproduce the distribu-tion of the observed values (Johnson & Young,2009; Yucel, He, & Zaslavsky, 2008) and maybe preferred.

Question 7: Should the imputed values berounded and recoded to fit the range of the

observed values?

Many imputation procedures yield implausibleimputed values. These include fractional values


(e.g., 1.234) when the observed values arewhole numbers and values that occur outsideof the observed range. A common strategyis to recode fractional values to the closestwhole number (by rounding) and to recodeout-of-range values to the nearest observedvalue within range. Although plausible valuesare intuitively appealing, rounding and range-adjusting may bias the estimates (Horton,Lipsitz, & Parzen, 2003). Because the imputationmodel assigns values that yield a ‘‘proper’’covariance matrix, altering these values willchange the covariances. Imputed values cannotbe viewed as real data—serving only as thefiller that facilitates a covariance-based analysis(which requires complete data to produceunbiased estimates). Therefore, the out-of-rangeand decimal values produced by imputationshould not pose a problem. Leaving unroundedand out-of-range values in the data is nonethelessoften troublesome for researchers because manypractical situations require discrete values (e.g.,logistic regression).

In Table 5, we compare rounded andunrounded estimates from the three subsam-ples. Estimates of both the b-coefficients andt-statistics were similar, with some evidencethat the rounded models were slightly less effi-cient in the two larger samples. Rounding madelittle difference in the substantive conclusionsfor this example, although this finding cannotbe generalized to situations beyond the oneswe tested. Our findings suggest that researcherswho prefer or require round values should, whenpossible, compare rounded and unrounded ver-sions of their models. When no differences arefound, rounding may be a practical strategy,albeit ‘‘wrong’’ in some mathematical respects.


For many applications, it is simple and entirelyappropriate to leave the missing values asimputed. In other cases, rounding and range-adjusting the imputed values would have theadvantage of making datasets more flexiblefor descriptive analysis, for release to otherresearchers, and for use in analysis meth-ods that require discrete values. It is prac-tical to pursue rounding strategies in thesecases. In this example, we found that round-ing made little difference to the substantiveconclusions. Other examples, however, haveshown that rounding and range-adjusting can

cause trouble for descriptive and multivariateresults (Horton et al., 2003). Alternative round-ing strategies have been proposed in Yucel, He,and Zaslavsky (2008) and Johnson and Young(2009). These strategies should be exploredwhen rounding is necessary, but traditionalrounding performs poorly (such as with highlyskewed distributions).

Question 8: Should cases that are missing onthe dependent variable be excluded from the

analysis or should these be retained andimputed along with the other variables

in the model?

Researchers are sometimes reluctant to imputevalues on the dependent variables because theybelieve that doing so would be treating cases inthe analysis with unknown outcomes as thoughthey were known. There is some evidence that,under special circumstances, excluding casesmissing on the outcome variable and imputingthe outcome lead to equivalent results. If themissing data are MCAR, or if there is nomissing data on the independent variables and nostrongly correlated auxiliary predictors (r > .5),MI cannot improve upon complete case analysis(Allison, 2001). Yet, routinely dropping all caseswith missing values on any of the variablestreated as outcomes may lead to problems.Some statistical analyses, such as path analysis,may treat some variables as independent inone equation and dependent in another. In thissituation, excluding cases with missing valueson variables ever treated as outcomes may resultin substantial loss in sample size, as well aspossible selection bias. In general, it is not safefor researchers to ignore values missing on thedependent variable.

There is no dependent – independent variabledistinction in MI algorithms. Instead, all vari-ables in the imputation model are treated as amultivariate response. This means that all vari-ables in the analysis model should be includedfor imputation, including the dependent variable(Graham, 2009; Schafer, 2003). The MI modelassumes that all relevant variables are included inthe imputation model. If the dependent variableis omitted from this model, the imputation willbe carried out under the assumption that there isno relationship (r = 0) between the dependentvariable and the independent variables. Thus,when the dependent variable is excluded fromthe imputation model, the relationship between


the dependent variable and the independent vari-ables becomes biased towards zero (Graham,2009; Little & Rubin, 2002).

Dependent – independent variable distinc-tions should clearly be left to post-imputationanalysis, but it is less obvious what to do withimputed values during analysis. Von Hippel(2007) has suggested that the dependent vari-able be imputed to keep the imputation modelspecified correctly, but that cases with imputedvalues on the dependent variable be removedbefore analysis (multiple imputation then dele-tion, or MID). Under extreme circumstances,MID may offer improved efficiency and pro-tection against problematic imputed values onthe dependent variable (von Hippel, 2007). TheMID method may be most attractive when thereare extreme levels of missingness on the depen-dent variable, say 20% to 50% (von Hippel,2007). With more commonly observed levelsof missingness, such as 5% to 10% (5% in theNSFH datasets we use here), the MID methodmay not offer a discernable advantage.

We compared two strategies for conductingMI when values were missing on the dependentvariable, shown in the bottom half of Table 5. Inthe first situation, the dependent variable was notimputed (nor was it included in the imputationmodel as auxiliary information). In the secondsituation, the dependent variable was imputedand imputed values were removed for analysis(von Hippel, 2007). Each of these models wascompared with a model where the dependentvariable was imputed and the imputed valueswere retained in the analysis. As expected,assuming the correlation between the dependentvariable and the independent variables was zeroin the imputation model systematically biasedthe analysis coefficients downward. The MIDmethod produced nearly identical b-coefficientsas the model where the imputed values wereretained in the analysis. Surprisingly, we alsodid not observe the expected consistent gainsin efficiency. This may be because we used 25imputed datasets. Simulations by von Hippel(2007) found that differences between MID andretaining imputed values greatly decreased witha greater number of imputed datasets, even withhigh levels of missingness.

Question 8 Recommendation

Excluding from the analysis cases that havemissing values on the dependent variable is

acceptable only under certain circumstances.Even in these situations, imputing the outcomedoes no harm and yields perfectly acceptableestimates when there are good auxiliary vari-ables in the imputation model. Imputing thedependent variable is a general procedure thatcreates a dataset that can be applied in a broadrange of models, so we recommend it for mostsituations and especially when the researchfocuses on more than one outcome variable.When constructing the imputation model, thedependent variable must be included, even ifimputed values are later excluded by the MIDprocedure. When there is a high percentage ofmissingness on the dependent variable, we rec-ommend using at least 25 imputed datasets.

Question 9: If a greater proportion of the datawere missing than is typically found in theanalysis of survey data, would the same

recommendations apply?

In the example dataset here, almost half of thecases had at least one missing value on thevariables in the regression, but most missingvalues came from the income and frequency-of-coitus variables. There are situations wherethe missing data may be more evenly distributedamong the variables—even with all cases havingat least one missing value. This situationmay occur when combining datasets (e.g.,the General Social Survey cumulative dataset,which combines many cross-sectional surveyyears), when using planned missing designs(Graham, Taylor, Olchowski, & Cumsille, 2006;Johnson, Roth, & Young, 2011), or when usinga dataset that contains both a first interviewincluding some variables and a mailed-backquestionnaire with other variables returned byonly some of the respondents (e.g., the NationalSurvey of Midlife Development in the UnitedStates). In such cases, modern methods may haveeven greater advantages over casewise deletion,as the difference in sample size may be muchlarger.

FIML and MI methods perform well evenwhen the proportion missing is substantiallyhigher than in our example. Many simulationstudies test missing data approaches with 50%or more missing values on variables in themodel (e.g., Allison, 2001; Collins et al., 2001).Nevertheless, large amounts of missing valuescan increase the chances that the imputationsoftware encounters problems imputing the data;


FIML models often have estimation problemsin such cases. Of course, when the proportionmissing is very large, standard errors are likelyto be high as less information is available, andunless it was planned missing, the likelihoodthat the missing at random (MAR) assumptionis violated increases.


There are no clear guidelines about the amountof missing information that would preclude useof the modern methods described here. Whenthere is a substantial amount of missing data, theresearcher should look closely at the reasons whythe data may be missing and consider the plausi-bility of the MAR assumption. Large amounts ofmissing data may signal serious design flaws orproblems in the data collection process, raisingconcerns about the validity of the data regardlessof the method used to handle the missingness.Generally, the best way to handle missing data isto design the data collection process to minimizeits occurrence. When the researcher has no con-trol over the amount of missing information, werecommend using modern methods, even in sit-uations with larger amounts of missing data thanwe tested here, because MI and FIML methodsare always at least as good as the traditional pro-cedures (Graham, 2009). The options selectedfor the missing data model, such as the num-ber of datasets generated and auxiliary variablesused, are likely to have greater effects on theestimates as the proportion of missing valuesbecomes larger. When many values are missing,good science dictates more rigorous testing ofalternate model specifications and transparencyof methods in reporting results.

Question 10: What information on theprocedure used to account for missing datashould be included in the research report?

Although the modern methods compared heregenerally yield similar results, it is importantthat sufficient detail be presented in the researchreport so that the reader can understand thespecific choices made in configuring the missingdata model. Our results, along with other studies,suggest that the three factors most likely toaffect the performance of the specific modernprocedure selected to account for missing dataare: (a) the amount of missing data, (b) the

number of datasets generated, and (c) thevariables used to inform the missing data model.


The details of the research report should includethe method and software used, the type of miss-ing data model used, the number of imputationsrun (if MI), the set of variables used to inform themissing data model, and the amount of missingdata in each variable. The latter can be includedin a table with the descriptive information oneach variable or, if only one or two variableshave higher rates of missing values, these couldbe identified in a statement indicating a thresh-old of missingness for the other variables. Anexample of a statement to be included in themethods section would be: ‘‘Missing data wereimputed using ICE in Stata. Twenty-five datasetswere generated and the imputation was informedby the variables in the analysis (including inter-action terms). Estimates from the 25 datasetswere pooled with the MIM prefix in Stata.’’

DISCUSSION

We began this research project with the intent ofdeveloping a set of recommendations or ‘‘bestpractices’’ for the family researcher who mustanalyze incomplete data. Our primary strategywas to assess, for a survey dataset widely usedamong family researchers, the sensitivity of thefindings from a regression analysis to severalpossible approaches to account for missing data.We applied several options in four softwareapplications designed to prepare the data toallow full use of complete and incomplete cases.Some differences were found in the estimatesthat are attributable to the missing data methods.An important, but not surprising, finding wasthat traditional methods of casewise deletion,mean substitution, and the use of a missingdata indicator each yielded results that raisedconcerns about potential biases likely to occurwhen employing these methods.

More importantly, we found that the modernmethods applied with many different optionsyielded findings that led to nearly identicalsubstantive interpretations and hypothesis testresults. There were differences, certainly, and wediscussed many of these, but most salient werethe similarities. In our example, a researcherwho used only the analysis model variablesto inform an imputation and produced only 5


datasets would have reached the same substan-tive conclusions as a researcher who informedan imputation with 40 auxiliary variables andimputed 25 datasets.

We tested these approaches with only oneregression model, fit to three subsamples ofone survey dataset; it is possible that othertypes of analysis models with different outcomeand predictor variables would have revealedmore differences. For example, the effects ofrounding might have been greater if our modelhad included more categorical independent vari-ables. The results might also have differed with abinary outcome variable or if we had used clus-tered or weighted adjustments. Finally, most ofthe methods used here assumed the data weremissing at random (MAR). Although simula-tion studies have shown that the MI and FIMLtechniques are quite robust to violation of thisassumption (e.g., if income nonresponse werehighly correlated with income level itself), it ispossible that all estimates, although consistent,were wrong. Because tests for the impact of datanot missing at random (NMAR) are complex orare unavailable with observed data, this potentialbias is best explored in simulations.

We believe that our approach of usingobserved data to compare models has some realadvantages over simulation studies. Althoughstrong inferences can be made from simulations,which detect bias with great efficiency, biasesare often observed only when the distributionand pattern of missing data are quite differentfrom what is typically found in family-relateddatasets. Simulation results must be closelyexamined to assess the conditions necessary toproduce enough bias to alter substantive conclu-sions. Our findings do not contradict previousstudies; rather, we show that the data patternsand distribution of missing values that typicallyoccur in family researchers’ datasets seldomreach the threshold where substantive conclu-sions are affected.

Other issues confronting the researcher whosedata contain missing values were not exploredbecause they could not be adequately evalu-ated with the empirical data and models usedin this study. One issue is whether inapplicableor ‘‘don’t know’’ responses can be imputed forinclusion in the analysis. Although this is animportant consideration for some data, the con-sequences of imputing such responses or treatingthem in other ways have not been sufficientlyexplored (Kroh, 2006). A second issue relates to

imputing missing values in panel (longitudinal)studies. The methods described here can be usedacross waves of panel data, and clear discussionis already available in the literature regardinghow this can be carried out (Allison, 2001).

Based on our findings, the steps thatresearchers might follow in handling missingdata in an analysis are as follows. First, theyshould become familiar with the variables theyplan to use, the amount of missing data present,reasons the missing data occurred, and patternsof missingness. They should then delineate thetypes of analyses they are likely to use and fea-tures of the data that might restrict the analyticalmodels, such as the need to use weights and toadjust for clustering. If the analysis plans involveonly a small number of continuous variables ora limited set of regression-type models, then aFIML method might be a preferred approachbecause it eliminates the extra effort requiredto impute. When the analysis strategy is incom-patible with a FIML approach, when multipleresearchers will be analyzing the same data, orwhen different analysis models need to be mutu-ally consistent, a MI approach may be preferred.

When a MI strategy is used, our resultssuggest that researchers have considerable flex-ibility in the choice of imputation model. If onlya small amount of data is missing (e.g., less than1% or 2% in any variable), varying the impu-tation model specifications is unlikely to resultin different substantive conclusions. With moremissing data present, 25+ imputed datasets aresuggested. If there is a need to impute a relativelylarge number of variables (e.g., 50 or more), theresearcher should be prepared for the imputationprocess to take hours of computer time and torequire some troubleshooting and adjustment ofthe parameters of the imputation process beforeit is successfully completed. Much of the prelim-inary analysis could be conducted in only oneof these imputed datasets, with the final modelsreplicated in each and combined with a flexibleprocedure such as with MIM in Stata or manuallycombined in a spreadsheet. If part of the analysisplan involves descriptive data, such as percent-age tables, or if a number of categorical variablesare used in the analyses as outcomes, then round-ing and recoding the imputed data into the rangesof the observed variables may be considered.

We believe that the choice of a missing datastrategy should be based on practical issuesrather than on the need to fit the data and the


analysis plans into a specific, pre-selected miss-ing data procedure. Our results show that this isa reasonable approach to the type of data familyresearchers are likely to use.

NOTE

An earlier version of this article was presented at the 2008Annual Meeting of the National Council on Family Rela-tions Pre-Conference Theory Construction and ResearchMethodology Workshop, Little Rock, Arkansas. We aregrateful to Alan C. Acock and Daniel W. Russell for theirvaluable feedback and to Patrick Royston for helpful con-sultation. We thank Laura Rosell and Claire Altman for theirassistance during manuscript preparation. This research waspartially supported by NICHD grant R01 HD044144 (DavidR. Johnson, PI).

SUPPORTING INFORMATION

Additional supporting information may be found in theonline version of this article:

Appendix A Table of Auxiliary Variables (used toinform the imputation model but not included in the analysismodel).

Appendix B Syntax for Multiple Imputation in DifferentSoftware Programs.

Appendix C Description of Files

Please note: Wiley-Blackwell is not responsible for thecontent or functionality of any supporting materials suppliedby the authors. Any queries (other than missing material)should be directed to the corresponding author for the article.

REFERENCES

Acock, A. C. (1997). Working with missing data.Family Science Review, 10, 76 – 102.

Acock, A. C. (2005). Working with missing val-ues. Journal of Marriage and Family, 67,1012 – 1028.

Allison, P. D. (2001). Missing data. Thousand Oaks,CA: Sage.

Amato, P., Booth, A., Johnson, D. R., & Rogers, S.(2007). Alone together: How marriage in Americais changing. Cambridge, MA: Harvard UniversityPress.

Bauldry, S. (2010, August). A simulation of thevalue of auxiliary variables with a directmaximum likelihood estimator. Paper presentedat the American Sociological Association AnnualMeeting, Atlanta, GA.

Carlin, J. B., Galati, J. C., & Royston, P. (2008). Anew framework for managing and analyzingmultiply imputed data in Stata. The Stata Journal,8, 49 – 67.

Cohen, J., & Cohen, P. (1985). Applied multipleregression and correlation analysis for thebehavioral sciences (2nd ed.). Hillsdale, NJ:Erlbaum.

Collins, L. M., Schafer, J. L., & Kam, C.-M. (2001).A comparison of inclusive and restrictive strategiesin modern missing data procedures. PsychologicalMethods, 6, 330 – 351.

Demirtas, H., Freels, S. A., & Yucel, R. M. (2008).Plausibility of multivarite normality assumptionwhen multiply imputing non-Gaussian continuousoutcomes: A simulation assessment. Journalof Statistical Computation and Simulation, 78,69 – 84.

Enders, C. K. (2006). A primer on the use of modernmissing-data methods in psychosomatic medicineresearch. Psychosomatic Medicine, 68, 427 – 436.

Enders, C. K. (2010). A note on the use of miss-ing auxiliary variables in full information maxi-mum likelihood-based structural equation models.Structural Equation Modeling, 15, 434 – 448.

Enders, C. K., & Peugh, J. L. (2004). Using an EMcovariance matrix to estimate structural equationmodels with missing data: Choosing an adjustedsample size to improve the accuracy of inferences.Structural Equation Modeling, 11, 1 – 19.

Graham, J. W. (2003). Adding missing-data-relevantvariables to FIML-based structural equation mod-els. Structural Equation Modeling, 10, 80 – 100.

Graham, J. W. (2009). Missing data analysis: Makingit work in the real world. Annual Review ofPsychology, 60, 549 – 576.

Graham, J. W., Olchowski, A. E., & Gilreath, T. D.(2007). How many imputations are really needed?Some practical clarifications of multiple imputa-tion theory. Prevention Science, 8, 206 – 213.

Graham, J. W., Taylor, B. J., Olchowski, A. E., &Cumsille, P. E. (2006). Planned missing datadesigns in psychological research. PsychologicalMethods, 11, 323 – 343.

Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). Apotential for bias when rounding in multiple impu-tation. The American Statistician, 57, 229 – 232.

Howell, D. C. (2008). The treatment of missing data.In W. Outhwaite & S. Turner (Eds.), Handbook ofsocial science methodology. London: Sage.

Johnson, D. R., Roth, V., & Young, R. (2011, April).Planned missing data designs in health surveys.Paper presented at the 10th Conference on HealthSurvey Research Methods, Peachtree City, GA.

Johnson, D. R., & Young, R. (2009). Improvingthe utility of imputed values in survey datasets.JSM Proceedings, Statistical Computing Section.Alexandria, VA: American Statistical Association.

Kenward, M. G., & Carpenter, J. (2007). Multipleimputation: Current perspectives. Statistical Meth-ods in Medical Research, 16, 199 – 218.

Kroh, M. (2006). Taking ‘‘don’t knows’’ as validresponses: A multiple complete random imputationof missing data. Quantity & Quality, 40, 225 – 244.


Lee, K. J. (2010). Multiple imputation for missingdata: Fully conditional specification versus mul-tivariate normal imputation. American Journal ofEpidemiology, 171, 1 – 9.

Little, R. J. A., & Rubin, D. B. (2002). Statisticalanalysis with missing data. Hoboken, NJ: JohnWiley & Sons.

Meng, X. L. (2001). A congenial overview and inves-tigation of multiple imputation inferences underuncongeniality. In R. M. Groves, D. A. Dillman,J. L. Eltinge, & R. J. A. Little (Eds.), Survey non-response. New York: Wiley.

Muthen, L., & Muthen, B. (2010). Mplus: User’sguide (6th ed.). Los Angeles: Muthen & Muthen.

Myung, J. (2003). Tutorial on maximum likelihoodestimation. Journal of Mathematical Psychology,47, 90 – 100.

Raghunathan, T. E. (2004). What do we do withmissing data? Some options for analysis ofincomplete data. Annual Review of Public Health,25, 99 – 117.

Robins, J. M., & Wang, N. (2000). Inference forimputation estimators. Biometrica, 87, 113 – 124.

Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley &Sons.

Rubin, D. B. (1996). Multiple imputation after 18+years. Journal of the American Statistical Associ-ation, 91, 473 – 489.

Schafer, J. L. (1997). Analysis of incomplete multi-variate data. London: Chapman & Hall.

Schafer, J. L. (2003). Multiple imputation in multi-variate problems when the imputation and analysismodels differ. Statistica Neerlandica, 57, 19 – 35.

Schafer, J. L., & Graham, J. W. (2002). Missing data:Our view of the state of the art. PsychologicalMethods, 7, 147 – 177.

von Hippel, P. (2004). Biases in SPSS 12.0 missingvalues analysis. The American Statistician, 58,93 – 108.

von Hippel, P. (2007). Regression with missingYs: An improved strategy for analyzing mul-tiply imputed data. Sociological Methodology,83 – 118.

von Hippel, P. (2009). How to impute interactions,squares, and other transformed variables. Socio-logical Methodology, 265 – 291.

Yoo, J. E. (2009). The effect of auxiliary variablesand multiple imputation on parameter estimationin confirmatory factor analysis. Educational andPsychological Measurement, 69, 929 – 947.

Yucel, R. M., He, Y., & Zaslavsky, A. M. (2008).Using calibration to improve rounding in imputa-tion. The American Statistician, 62, 125 – 129.

Yucel, R. M., Schenker, N., & Raghunathan, T. E.(2007). Sequential hierarchical regression impu-tation (SHRIMP). Unpublished manuscript, Uni-versity of Massachusetts Amherst.

Documents

Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations