1 Handling Missing Data Estie Hudes Tor Neilands UCSF Center for AIDS Prevention Studies March 16, 2007

1

Handling Missing Data

Estie HudesTor Neilands

UCSF Center for AIDS Prevention StudiesMarch 16, 2007

2

Presentation Overview

Overview of concepts and approaches to handling missing data

Missing data mechanisms - how data came to be missing

Problems with popular ad-hoc missing data handling methods

A more modern, better approach: Maximum likelihood (FIML/ Direct ML)

More on modern approaches: the EM algorithm Another modern approach: Multiple Imputation (MI) Extensions and Conclusions

3

Types of Missing Data

Item-missing: respondent is retained in the study, but does not answer all questions

Wave-missing: respondent is observed at intermittent waves

Drop-out: respondent ceases participation and is never observed again

Combinations of the above

4

Methods of Handling Missing Data

First method: Prevention of missing cases (e.g., loss to follow-up) and individual item non-response

Second method: Ad-hoc approaches (e.g., listwise/casewise deletion)

Third method: Maximum likelihood-based approaches (e.g., direct ML) and related approaches (e.g., restricted ML)

5

Prevention of Missing Data

Minimize individual item non-response CASI and A-CASI may prove helpful Interviewer-administered surveys Avoid self-administered surveys where possible

Minimize loss to follow-up in longitudinal studies by incorporating good participant tracking protocols, appropriate use of incentives, and reducing respondent burden

6

Ad-hoc Approaches to Handling Missing Data

Listwise deletion (a.k.a. complete-case analysis)

Pairwise deletion (a.k.a. available-case analysis)

Dummy variable adjustment (Cohen & Cohen) Single imputation

Replacement with variable or participant means Regression Hot deck

7

Modern Approaches of Handling Missing Data

Maximum likelihood (FIML/direct ML)EM algorithmMultiple imputation (MI)Selection models and pattern-mixture

models for non-ignorable dataWeightingWe will confine our discussion to Direct

ML, EM algorithm and Multiple Imputation

8

A Tour of Missing Data Mechanisms

How did the data become incomplete or missing? Missing Completely at Random (MCAR) Missing at Random (MAR) Not Ignorable Non-Response (NMAR;

non-ignorable missingness; informative missingness)

Influential article: Rubin (1976) in Biometrika

9

Missing Data Mechanisms: Missing Completely at Random

Pr(Y is missing|X,Y) = Pr(Y is missing) If incomplete data are MCAR, the cases with

complete data are then a random subset of the original sample.

A good situation to be in if you have missing data because listwise deletion of the cases with incomplete data is generally justified.

A down side is loss of statistical power, especially if there are many cases, and the number of cases with complete data is a small fraction of the original number of cases.

10

Missing Data Mechanisms: Missing at Random

Pr(Y is missing|X,Y) = Pr(Y missing|X)Within each level of X, the

probability that Y is missing does not depend on the numerical value of Y.

Data are MCAR within each level of X.

MAR is a much less restrictive assumption than MCAR.

11

Missing Data Mechanisms: Not Missing at Random

If incomplete data are neither MCAR nor MAR, the data are considered NMAR or non-ignorable.

Missing data mechanism must be modeled to obtain good parameter estimates.

Heckman’s selection model is one example of NMAR modeling. Pattern mixture models are another NMAR approach.

Disadvantages of NMAR modeling: Requires high level of knowledge about missingness mechanism; results often highly sensitive to the choice of NMAR model selected.

12

Missing Data Mechanisms: Examples (1)

Scenario: Measuring systolic blood pressure (SBP) in January and February (Schafer and Graham, 2002, Psychological Methods, 7(2), 147-177) MCAR: Data missing in February at random, unrelated to SBP

level in January or February or any other variable in the study - missing cases are a random subset of the original sample’s cases.

MAR: Data missing in February because the January measurement did not exceed 140 - cases are randomly missing data within the two groups: SBP > 140 and SBP <= 140.

NMAR: Data missing in February because the February SBP measurement did not exceed 140. (SBP taken, but not recorded if it is <= 140.) Cases’ data are not missing at random.

13

Missing Data Mechanisms: Examples (2 )

Scenario: Measuring Body Mass Index (BMI) of ambulance drivers in a longitudinal context (Heitjan, 1997, AJPH, 87(4), 548-550). MCAR: Data missing at follow-up because participants were

out on call at time of scheduled measurement, i.e., reason for data missingness is unrelated to outcome or other measured variables - missing cases are a random subset of the population of all cases.

MAR: Data missing at follow-up because of high BMI and embarrassment at initial visit, regardless of whether participant gained or lost weight since baseline, i.e., reason for data missingness is related to BMI, a measured variable in the study.

NMAR: Data missing at follow-up because of weight gain since last visit (assuming weight gain is unrelated to other measured variables in the study).

14

More on Missing Data Mechanisms

Ignorable data missingness - occurs when data are incomplete due to MCAR or MAR process

If incomplete data arise from an MCAR or MAR data missingness mechanism, there is no need for the analyst to explicitly model the missing data mechanism (in the likelihood function), as long as the analyst uses software programs that take the missingness mechanism into account internally (several of these will be mentioned later)

Even if data missingness is not fully MAR, methods that assume MAR usually (though not always) offer lower expected parameter estimate bias than methods that assume MCAR (Muthén, Kaplan, & Hollis, Psychometrika, 1987).

15

Ad-hoc Methods Unraveled (1)

Listwise deletion: delete all cases with missing value on any of the variables in the analysis. Only use complete cases.

OK if missing data are MCAR Parameter estimates unbiased Standard errors appropriate

But, can result in substantial loss of statistical power

Biased parameter estimates if data are MAR Robust to NMAR for predictor variables Robust to NMAR for predictor variables OR

outcome variable in logistic regression models (slopes only)

16


Pairwise deletion: use all available cases for computation of any sample moment For computation of means, use all available data for

each variable; For computation of covariances, use all available data

on pairs of variables. Can lead to non-positive definite var-cov matrices

because it uses different pairs of cases for each entry.

Can lead to biased standard errors under MAR.

17


Dummy variable adjustment Advocated by Cohen & Cohen (1985)

1. When X has missing values, create a dummy variable D to indicate complete case versus case with missing data.

2. When X is missing, fill in a constant c3. Regress Y on X and D (and other non-missing

predictors).

Produces biased coefficient estimates (see Jones’ 1996 JASA article)

18


Single imputation (of missing values) Mean substitution - by variable or by observation Regression imputation (i.e., replacement with conditional

means) Hot deck: Pick “donor” cases within homogeneous strata of

observed data to provide data for cases with unobserved values.

These methods lead to biased parameter estimates (e.g., means, regression coefficients); variance and standard error estimates that are biased downwards. One exception: Rubin (1987) provides a hot-deck based method of multiple imputation that may return unbiased parameter estimates under MAR.

Otherwise, these methods are not recommended.

19

Modern Methods: Maximum Likelihood (1)

When there are no missing data: Uses the likelihood function to express the probability

of the observed data, given the parameters, as a function of the unknown parameter values.

Example: where p(x,y|θ) is the (joint) probability of observing (x,y) given a parameter θ, for a sample of n independent observations. The likelihood function is the product of the separate contributions to the likelihood from each observation.

MLEs are the values of the parameters which maximize the probability of the observed data (the likelihood).

n

i ii yxpL1 , )|()(

20


Under ordinary conditions, ML estimates are: consistent (approximately unbiased in large samples) asymptotically efficient (have the smallest possible variance) asymptotically normal (one can use normal theory to

construct confidence intervals and p-values). The ML approach can be easily extended to MAR

situations: The contribution to the likelihood from an observation

with X missing is the marginal: g(yj|θ) = xp(x,yj|θ) This likelihood may be maximized like any other

likelihood function. Often labeled FIML or direct ML.

)|()|()(11 ,

n

mj j

m

i ii ygyxpL

21


Available software to perform FIML estimation: AMOS - Analysis of Moment Structures

Commercial program licensed as part of SPSS (CAPS has a 10-user license for this product)

Fits a wide variety of univariate and multivariate linear regression, ANOVA, ANCOVA, and structural equation (SEM) models.

http://www.smallwaters.com Mx - Similar to AMOS in capabilities, less user-friendly

Freeware: http://views.vcu.edu/mx LISREL - Similar to AMOS, more features, less user-friendly

Commercial program: http://www.ssicentral.com

22


Available software: lEM Loglinear & Event history analysis w/ Missing data (Jeroen

Vermunt)Freeware DOS program downloadable from the Internet

• http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html

Fits log-linear, logit, latent class, and event history models with categorical predictors.

MplusSimilar capabilities to AMOS (commercial)Less easy to use than AMOS, but more general modeling

features.http://www.statmodel.com

23


Longitudinal data analysis software options (not discussed): Normally distributed outcomes

SAS PROC MIXEDS-PLUS LMEStata XTREG and XTREGAR and XTMIXED

PoissonStata XTPOIS

Negative BinomialStata XTNBREG

LogisticStata XTLOGIT

24


Software for longitudinal analyses (continued) General modeling of clustered and longitudinal data

Stata GLLAMM add-on command SAS PROC NLMIXED S-PLUS NLME

What about Generalized Estimating Equations (GEE) for analysis of longitudinal or clustered data with missing observations?

Assumes incomplete data are MCAR. See Hedeker & Gibbons, 1997, Psychological Methods, p. 65. & Heitjan, AJPH, 1997, 87(4), 548-550.

Can be extended to accommodate the MAR assumption via a weighting approach developed by Robbins, Rodnitzky, & Zhao (JASA, 1995), but it has limited applicability.

25

Maximum Likelihood Example (1) 2 x 2 Table with missing data

Vote (Y=V)Sex (X=S) Yes No . Y N

Male 28 45 10 (73) p11 p12

Female 22 52 15 (74) p21 p22

Total 50 97 25 (147) 1

Likelihood function: L(p11, p12, p21, p22) = (p11)28(p12)45 (p21)22 (p22)52 (p11+p12)10 (p21+p22)15

26

Maximum Likelihood Example (2) 2 x 2 Table with missing data

3636.0)172

1574)(

74

52(

1538.0)172

1574)(

74

22(

2975.0)172

1073)(

73

45(

1851.0)172

1073)(

73

28(

22

21

12

11

p

p

p

p

27

Maximum Likelihood Example (3)

Using lEM for 2 x 2 TableInput (partial)* R = response (NM) indicator* S = sex; V = vote;

man 2 * 2 manifest variables res 1 * 1 response indicator dim 2 2 2 * with two levels lab R S V * and label R sub SV S * defines these two

* subgroups mod SV * model for complete dat [28 45 22 52 * subgroup SV 10 15] * subgroup S

Output (partial)*** (CONDITIONAL) PROBABILITIES ***

* P(SV) * complete data only

1 1 0.1851 (0.0311) 0.1905 (0.0324)

1 2 0.2975 (0.0361) 0.3061 (0.0380)

2 1 0.1538 (0.0297) 0.1497 (0.0294)

2 2 0.3636 (0.0384) 0.3537 (0.0394)

* P(R) * 1 0.8547 2 0.1453

28

Maximum Likelihood Example (1)Continuous outcome & multiple predictors

Data on American colleges and universities through US News and World Report

N = 1302 collegesAvailable from

http://lib.stat.cmu.edu/datasets/colleges

Described on p. 21 of Allison (2001)

29


Outcome: gradrat - graduation rate (1,204 non-missing cases)

Predictors csat - combined average scores on verbal and math SAT

(779 non-missing cases) lenroll - natural log of the number of enrolling freshmen

(1,297 non-missing cases) private - 1 = private; 0 = public (1,302 non-missing cases) stufac - ratio of students to faculty (x 100; 1,300 non-

missing cases) rmbrd - total annual cost of room and board (thousands of

dollars; 1,300 non-missing cases) act - Mean ACT scores (714 non-missing cases)

30

Maximum Likelihood Example (3) Continuous outcome & multiple predictors

Predict graduation rate from Combined SAT Number of enrolling freshmen on log scale Student-faculty ratio Private or public institution classification Room and board costs

Use a linear regression modelACT score included as an auxiliary variableUse AMOS and Mplus to illustrate direct ML

31


AMOS: Two methods for model specification Graphical user interface AMOS BASIC programming language

Results (assuming joint MVN)Regression Weights

Estimate S.E. C.R. P GradRat <-- CSAT 0.0669 0.0048 13.9488 0.0000GradRat <-- LEnroll 2.0832 0.5953 3.4995 0.0005GradRat <-- StuFac -0.18140.0922 -1.96780.0491GradRat <-- Private 12.9144 1.2769 10.1142 0.0000

GradRat <-- RMBRD 2.4040 0.5481 4.3856 0.0000

32


Mplus example (assuming joint MVN)INPUT INSTRUCTIONS TITLE: P. Allison 6/2002 Oakland, CA Missing Data Workshop non-normal example DATA: FILE IS D:\My Documents\Papers\Allison-Paul\usnews.txt; VARIABLE: NAMES ARE csat act stufac gradrat rmbrd private lenroll; USEVARIABLES ARE csat act stufac gradrat rmbrd private lenroll; MISSING ARE ALL . ; ANALYSIS: TYPE = general missing h1 ; ESTIMATOR = ML ; MODEL: gradrat ON csat lenroll stufac private rmbrd ; gradrat WITH act ; csat WITH lenroll stufac private rmbrd act ; lenroll WITH stufac private rmbrd act ; stufac WITH private rmbrd act ; private WITH rmbrd act ; rmbrd WITH act ; OUTPUT: patterns ;

33


Mplus results (assuming joint MVN)

MODEL RESULTS

Estimates S.E. Est./S.E.

GRADRAT ON CSAT 0.067 0.005 13.954 LENROLL 2.083 0.595 3.501 STUFAC -0.181 0.092 -1.969 PRIVATE 12.914 1.276 10.118 RMBRD 2.404 0.548 4.387

34


Mplus example for continuous, non-normal data Uses sandwich estimator robust to non-normality Specify MLR instead of ML as the estimator Mplus MLR estimator assumes MCAR missingness

and finite fourth-order moments (i.e., kurtosis is non-zero); initial simulation studies show low bias with MAR data

Estimates S.E. Est./S.E. GRADRAT ON CSAT 0.067 0.005 13.312 LENROLL 2.083 0.676 3.083 STUFAC -0.181 0.093 -1.950 PRIVATE 12.914 1.327 9.735 RMBRD 2.404 0.570 4.215

35

Maximum Likelihood Summary

ML advantages: Provides a single, deterministic set of results

appropriate under MAR data missingness. Well-accepted method for handling missing values (e.g.,

for grant writing). Generally fast and convenient.

ML disadvantages: Parametric: may not always be robust to violations of

distributional assumptions (e.g., multivariate normality). Only available for some models via canned software

(would need to program other models). Most readily available for continuous outcomes and

ordered categorical outcomes. Available for Poisson or Cox regression with continuous

predictors in Mplus, but requires numerical integration, which is time-consuming and can be challenging to use, especially with large numbers of variables.

36

Modern Methods: EM Algorithm (1)

EM algorithm proceeds in two steps to generate ML estimates for incomplete data: Expectation and Maximization. The steps alternate iteratively until convergence is attained.

Seminal article by Dempster, Laird, & Rubin (1977), Journal of the Royal Statistical Society, Series B, 39, 1-38. Early treatment by H.O. Hartley (1958), Biometrics, 14(2), 174-194.

Goal is to estimate sufficient statistics that can then be used for substantive analyses. In normal theory applications these would be the means, variances and covariances of the variables (first and second moments of the normal distributions of the variables).

Example from Allison, pp. 19-20: For a normal theory regression scenario, consider four variables X1 - X4 that have some missing data on X3 and X4.

37


Starting Step (0): Generate starting values for the means and

covariance matrix. Can use the usual formulas with listwise or pairwise deletion.

Use these values to calculate the linear regression of X3 on X1 and X2. Similarly for X4.

Expectation Step (1): Use the linear regression coefficients and the

observed data for X1 and X2 to generate imputed values of X3 and X4.

38


Maximization Step (2): Use the newly imputed data along with the original

data to compute new estimates of the sufficient statistics (e.g., means, variances, and covariances)

Use the usual formula to compute the meanUse modified formulas to compute variances and

covariances that correct for the usual underestimation of variances that occurs in single imputation approaches.

Cycle through the expectation and maximization steps until convergence is attained (sufficient statistic values change slightly from one iteration to the next).

39


EM Advantages: Only needs to assume incomplete data arise from

MAR process, not MCAR Fast (relative to MCMC-based multiple imputation

approaches) Applicable to a wide range of data analysis

scenarios Uses all available data to estimate sufficient

statistics Fairly robust to non-joint MVN data Provides a single, deterministic set of results May be all that is needed for non-inferential

analyses (e.g., Cronbach’s alpha or exploratory factor analysis)

Lots of software (commercial and freeware)

40


EM Disadvantage: Produces correct parameter estimates, but

standard errors for inferential analyses will be biased downward because analyses of EM-generated data assume all data arise from a complete data set without missing information. The analyses of the EM-based data do not properly account for the uncertainty inherent in imputing missing data.

Recent work by Meng provides a method by which appropriate standard errors may be generated for EM-based parameter estimates

Bootstrapping may also be used to overcome this limitation

41

Modern Methods: Multiple Imputation (1)

What is unique about MI: We impute multiple data sets to analyze, not a single data set as in single imputation approaches Use the EM algorithm to obtain starting values for MI The differences between the imputed data sets capture the

uncertainty due to imputing values The actual values in the imputed data sets are less important

than analysis results combined across all data sets Several MI advantages:

MI yields consistent, asymptotically efficient, and asymptotically normal estimators under MAR (same as direct ML)

MI-generated data sets may be used with any kind of software or model

42


The MI point estimate is the mean:

The MI variance estimate is the sum of Within and Between imputation variation:

where

(Qi and Vi are the parameter estimate and its variance in the ith imputed dataset)

m

i

iQm

Q1

1

BWVm

)1( 1

m

i

im

QQB1

21 )()1(

m

i

iVm

V1

1

43


Imputation model vs. analysis model Imputation model should include any auxiliary variables

(i.e., variables that are correlated with other variables that have incomplete data; variables that predict data missingness)

Analysis model should contain a subset of the variables from the imputation model and address issues of categorical data, non-normal data

Texts that discuss MI in detail: Little & Rubin (2002, John Wiley and Sons): A seminal classic Rubin (1987, John Wiley and Sons): Non-response in surveys J. L. Schafer (1997, Chapman & Hall): Modern and updated P. Allison (2001, Sage Publications series # 136): A readable

and practical overview of and introduction to MI and missing data handling approaches

44


Multivariate normal imputation approach MI approaches exist for multivariate normal data,

categorical data, mixed categorical and normal variables, and longitudinal/clustered/panel data.

The MV normal approach is most popular because it performs well in most applications, even with somewhat non-normal input variables (Schafer, 1997)

Variable transformations can further improve imputations For each variable with missing data, estimate the linear

regression of that variable on all other variables in the data set.

Using a Bayesian prior distribution for the parameters, typically noninformative, regression parameters are drawn from the posterior Bayesian distribution. Estimated regression equations are used to generate predicted values for missing data points.

45


Multivariate normal imputation approach (continued) Add to each predicted value a random draw from the residual

normal distribution to reflect uncertainty due to incomplete data.

Obtaining Bayesian posterior random draws is the most complex part of the procedure. Two approaches:

Data augmentation - implemented in NORM and PROC MI• Uses a Markov-Chain Monte Carlo (MCMC) approach to generate the

imputed values A variant of Data augmentation - implemented in ice (and MICE)

• Uses a Gibbs sampler and switching regressions approach (Fully Conditional Specification - FCS) to generate the imputed values (van Buuren)

Sampling Importance/Resampling (SIR) - implemented in Amelia and a user-written macro in SAS (sirnorm.sas); claimed to be faster than data augmentation-based approaches.

“The relative superiority of these methods is far from settled” (Allison, 2001, p. 34)

46


Steps in using MI Select variables for the imputation model - use all variables in the

analysis model, including any dependent variable(s), and any variables that are associated with variables that have missing data or the probability of those variables having missing data (auxiliary variables), in part or in whole.

Transform non-normal continuous variables to attain normality (e.g., skewed variables)

Select a random number seed for imputations (if possible) Choose number of imputations to generate

Typically 5 to 10: > 90% coverage & efficiency with 90% or less missing information in large sample scenarios with M = 5 imputations (Rubin, 1987)

Sometimes, however, you may need more imputations (e.g., 20 or more for some longitudinal scenarios).

You can compute the relative efficiency of parameter estimates as: relative efficiency = (1 / (1 + rate of missing information / number of imputations)) X 100. Several MI software programs output the missing information rates for parameters, allowing the analyst to easily compute relative efficiencies

47


Steps in using MI (continued): Produce the multiply imputed data sets

Estimated parameters must be independent of initial values Assess independence via autocorrelation and time series plots

(when using MCMC-based MI programs) Back-transform any previously transformed variables and

round imputations for discrete variables. Analyze each imputed data set using standard statistical

approaches. If you generated M imputations (e.g., 5), you would perform M separate, but identical analyses (e.g., 5).

Combine results from the M multiply imputed analyses (using NORM, SAS PROC MIANALYZE, or Stata miest or micombine) using Rubin’s (1987) formulas to obtain a single set of parameter estimates and standard errors. Both p-values and confidence intervals may be generated.

48


Steps in using MI (continued) Rules for combining parameter estimates and standard errors

A parameter estimate is the mean of the parameter estimates from the multiple analyses you performed.

The standard error is computed as follows:• Square the standard errors from the individual analyses.• Calculate the variance from the squared SEs across the M

imputations. • Add the results of the previous two steps together, applying a

small correction factor to the variance in the second step, and take the square root.

There is a separate F-statistic available for multiparameter inference (i.e., multi-DF tests of several parameters at once).

It is also possible to combine chi-square tests from the analysis of multiply imputed data sets.

49


Is it wrong to impute the DV? Yes, if performing single, deterministic imputation

(methods historically used by econometricians) No, if using the random draw approach of Rubin. In

fact, leaving out the DV will cause bias (it will bias the coefficients towards zero).

Given that the goal of MI is to reproduce all the relationships in the data as closely as possible, this can only be accomplished if all the dependent variable(s) are included in the imputation process.

50


Available imputation software for data augmentation: SAS: PROC MI and PROC MIANALYZE (demonstrated)

MI produces imputations MIANALYZE combines results from analyses of imputed data into

a single set of hypothesis tests NORM - for MV normal data (J. L. Schafer)

Windows freeware S-Plus MISSING library R (add-in file)

CAT, MIX, and PAN - for categorical data, mixed categorical/normal data, and longitudinal or clustered panel data respectively (J. L. Schafer)

S-Plus MISSING library R (add-in file)

LISREL - http://www.ssicentral.com (Windows, commercial)

51


Newly Available MI Software from Stata:(Uses Gibbs sampler and switching regressions; related to data augmentation)

Can handle continuous, dichotomous, categorical

and ordinal data

Can handle interactions

Stata: -ice- with –micombine- http://www.stata.com/search.cgi?query=ice http://www.ats.ucla.edu/stat/stata/library/ice.htm From inside Stata: . findit multiple imputation

http://www.ats.ucla.edu/stat/stata/library/ice.htm





52


Available Imputation Software for Sampling Importance/Resampling (SIR): AMELIA

Windows freeware version (NOT demonstrated) Produces the multiply imputed MI data sets.

http://pantheon.yale.edu/~ks298/index_files/software.htmhttp://gking.harvard.edu/amelia/

More complete Gauss version available http://www.aptech.com/

STATA can be used on datasets from AMELIA (NOT demonstrated)• MIEST - a user-written command to run and combine separate analyses

into a single model. http://gking.harvard.edu/amelia/amelia1/docs/mi.zip

• MIEST2 - modifies MIEST to output non-integer DF for hypothesis tests SIRNORM.SAS - SAS user-written macro

http://yates.coph.usf.edu/research/psmg/Sirnorm/sirnorm.html

http://pantheon.yale.edu/~ks298/index_files/software.htm

http://gking.harvard.edu/amelia/

http://www.aptech.com/

http://gking.harvard.edu/amelia/amelia1/docs/mi.zip

http://yates.coph.usf.edu/research/psmg/Sirnorm/sirnorm.html

53

Multiple Imputation Example (1)[Same as ML Example]

Data on American colleges and universities from US News and World Report

N = 1302 collegesAvailable from

http://lib.stat.cmu.edu/datasets/colleges

Described on p. 21 of Allison (2001)

54

Multiple Imputation Example (2)

Outcome: gradrat - graduation rate (1,204 non-missing cases) Predictors

csat - combined average scores on verbal and math SAT (779 non-missing cases)

lenroll - natural log of the number of enrolling freshmen (1,297 non-missing cases)

private - 1 = private; 0 = public (1,302 non-missing cases) stufac - ratio of students to faculty (x 100; 1,300 non-

missing cases) rmbrd - total annual cost of room and board (thousands of

dollars; 1,300 non-missing cases) Auxiliary Variable

act - Mean ACT scores (714 non-missing cases)

55

MI SAS Example (1)

Using SAS to perform multiple imputation Suggest running PROC UNIVARIATE or PROC FREQ prior to

running PROC MI in order to examine distributions of variables, identify ranges, and integer precision of each variable.

Some variables will have predefined ranges that can be specified in PROC MI. E.g., CSAT ranges 400 to 1600.

Ranges for other variables can be set to their empirical values. SAS creates a single SAS data set containing the individual

imputed data sets stacked. Each inputed data set is denoted by the value of the SAS variable _IMPUTATION_. You can run substantive analyses on the imputed data sets by using a SAS BY statement (e.g, BY _IMPUTATION_ ; ).

56

MI SAS Example (2)

PROC MI syntax for college graduation data set examplePROC MI DATA = paul.usnews

OUT = miout

NIMPUTE = 10

SEED = 12345678

MINIMUM = 400 11 . 0 . 0 0

MAXIMUM = 1600 31 100 100 . 1 .

ROUND = 1 1 . 1 .001 1 . ;

MCMC CHAIN = MULTIPLE

NBITER = 500 NITER = 250

TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF)

ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ;

TITLE "Multiple Imputation procedure run on US News college data set" ;

VAR csat act stufac gradrat rmbrd private lenroll ;

RUN ;

57

MI SAS Example (3)

PROC MI StatementPROC MI DATA = paul.usnews

OUT = miout

NIMPUTE = 10

SEED = 12345678

MINIMUM = 400 11 . 0 . 0 0

MAXIMUM = 1600 31 100 100 . 1 .

ROUND = 1 1 . 1 .001 1 . ;

NIMPUTE: the number of imputations (default = 5) SEED: use the same random number seed to replicate imputations

over multiple program runs MINIMUM, MAXIMUM, and ROUND

Order of values corresponds to variables listed in the VAR statement (i.e., csat act stufac gradrat rmbrd private lenroll)

csat, stufac, and gradrat ranges set on basis of meaningful expectations; others are set via empirical frequency data.

specify minimum values, maximum values, and values to which imputations are rounded. Useful for handling categorical and integer variables. Dots/Periods represent no values specified. First variable cannot have a period placeholder.

58

MI SAS Example (4)

MCMC StatementMCMC CHAIN = MULTIPLE

NBITER = 500 NITER = 250

TIMEPLOT (MEAN(csat rmbrd) COV (gradrat*rmbrd) WLF)

ACFPLOT (MEAN(csat rmbrd) COV(gradrat*rmbrd) WLF) ;

CHAIN - selects single or multiple chain Markov-Chain Monte Carlo data augmentation procedure. Multiple chain may be slightly preferred (Allison, 2001, p. 38).

NBITER - number of “burn in” iterations performed prior to imputed data sets being created. Often set to twice the number of iterations EM requires to converge (Schafer).

NITER - number of iterations between creation of each imputed data set. More iterations ensure independence between imputed data

sets. You can diagnose non-independence with time series and

autocorrelation plots.

59

MI SAS Example (5)

MCMC Statement (continued) TIMEPLOT - produces time series plot for the

worst linear function of variables containing the most missing data (csat and rmbrd)

ACFPLOT - produces autocorrelation plot for the worst linear functions of variables containing the most missing data

TRANSFORM statement also available for variable transformations

Example: TRANSFORM LOG(rmbrd/c=5)C option adds a constant prior to transformationAvailable transformations: Box-Cox, Exp, Logit, Log,

Power

60

MI SAS Example (6)Time Series Plot

945

950

955

960

965

970

I t er at i on

- 500 - 400 - 300 - 200 - 100 0

61

MI SAS Example (7)Autocorrelation Plot

- 1. 0

- 0. 5

0. 0

0. 5

1. 0

Lag

0 2 4 6 8 10 12 14 16 18 20

62

MI SAS Example (8)

ML linear regression analysis of the data output by PROC MI using PROC GENMOD

PROC GENMOD DATA = miout ;

TITLE "Illustration of GENMOD analysis of the college data set" ;

MODEL gradrat = csat lenroll stufac private rmbrd / COVB ;

BY _IMPUTATION_ ;

ODS OUTPUT PARAMETERESTIMATES=gmparms COVB=gmcovb ;

RUN ;

BY statement repeats analysis for each imputed data set COVB option on MODEL statement displays the variance-covariance

matrix of the parameter estimates ODS OUTPUT statement outputs the parameter estimates and their

variance-covariance matrix to separate SAS data sets, gmparms and gmcovb, respectively. These data sets are then combined by PROC MIANALYZE to return a single set of results to the analyst.

63

MI SAS Example (9)

Combining GENMOD results with PROC MIANALYZE: Single Parameter Inference

PROC MIANALYZE PARMS = gmparms COVB = gmcovb ;

TITLE "Single DF inferences of GENMOD analysis of US News college data set" ;

VAR intercept csat lenroll stufac private rmbrd ;

RUN ; PARMS statement reads the parameter estimates;

COVB reads the variance-covariance matrix of parameter estimates

Note presence of INTERCEPT term on VAR statement - you will need to include it to obtain INTERCEPT results

64

MI SAS Example (10)

Combining GENMOD results with PROC MIANALYZE: Multiparameter Inference

PROC MIANALYZE MULT PARMS = gmparms COVB = gmcovb ;

TITLE "Multivariate inference of MIXED analysis of US News college data set" ;

VAR csat lenroll stufac private rmbrd ;

RUN ;

MULT statement performs multivariate hypothesis testing

Note absence of intercept in the VAR statement - we do not want it included as part of the list of variables tested

65

MI SAS Example (11)

Inference using other SAS procedures REG, LOGISTIC, PROBIT, LIFEREG, and PHREG: use

OUTEST = and COVOUT statements MIXED, GLM, and CALIS: use ODS

MIXED• request SOLUTION and COVB as MODEL statement options• ODS OUTPUT SOLUTIONF = gmparms COVB = gmcovb ;

GENMOD for GEE: use ODS as shown in this example

substitute GEEempest and GEERCov ODS tables for the parameter estimate and covariance matrix tables shown in the above example.

66

MI Stata Example (1)Using Stata to check the original data* Read in original data , and save as *.dta

. insheet using usnewsN.txt, names delimit (" ") clear

. save usnews.dta, replace

* Obtain (available cases, single) estimates of means and variance. summarize gradrat csat lenroll stufac private rmbrd act

* Obtain (available cases, pairwise) estimates of correlations. pwcorr gradrat csat lenroll stufac private rmbrd act, obs

* Obtain (complete cases) estimates of correlations, means and variance. corr gradrat csat lenroll stufac private rmbrd, obs

* Obtain (complete cases) estimates of regression coefficients. regress gradrat csat lenroll stufac private rmbrd

* patterns of missingness. mvpatterns gradrat csat lenroll stufac private rmbrd

67

MI Stata Example (2)

Using Stata to create the multiply imputed datasets (stacked together in a single dataset)

. use usnews, clear

. mvis csat act stufac gradrat rmbrd private lenroll using usnews_mvis10, m(10) genmiss(m_) seed(12345678)

OR (better):. ice csat act stufac gradrat rmbrd private lenroll using usnews_ice10,

m(10) seed(12345678)

Using Stata to analyze the multiply imputed datasets and combine the results

*–micombine- to obtain MI estimates of regression coefficients. use usnews_ice10, clear. micombine regress gradrat csat lenroll stufac private rmbrd. testparm csat lenroll stufac private rmbrd

68

Multiple Imputation Summary

Multiple imputation is flexible: imputed datasets can be analyzed using parametric and non-parametric techniques

MI is available in SAS, and in S-PLUS MISSING library. Also free via NORM and AMELIA, and in R. Some SAS procedures are easier to use with MI than others; SAS and NORM

permit user-specified random number seeds SAS and NORM permit testing multiparameter hypotheses

Multiple imputation using Stata: You can use the Stata command ice to generate multiply imputed data sets and

the command micombine to combine the results from analyses of imputed data sets in Stata. ice allows imputation of unordered or ordered categorical and continuous, normally distributed variables. It also handles interactions properly.

Alternatively, you can use AMELIA to generate multiply imputed data sets and feed them into Stata for analyses. miest / miest2 can then combine the analysis results.

All Stata estimation commands are equally easy to use with micombine, miest(2).

ice permits user-specified random number seeds. micombine permits testing multiparameter hypotheses

Multiple imputation is non-deterministic: you get a different result each time you generate imputed data sets (unless the same random number seed is used each time)

It is easy to include auxiliary variables into the imputation model to improve the quality of imputations

Compared with direct ML, large numbers of variables may be handled more easily.

69

Comparison of Regression Example Results

Listwise w/SAS PROC

GENMOD

Mplus Direct

ML

Mplus Robust

ML

SAS MI With PROC

GENMOD

Stata ice

with micombine

CSAT .067 (.006)

.067 (.005)

.067 (.005)

.067 (.005)

.066 (.005)

LEnroll 2.417 (.953)

2.083 (.595)

2.083 (.676)

2.185 (.575)

2.129 (.598)

StuFac -.123 (.131)

p = .348

-.181 (.092)

p = .049

-.181 (.097)

p = .051

-.184 (.097)

p=.061

-.189 (.101)

p=.066

Private 13.588 (1.933)

12.914 (1.276)

12.914 (1.327)

13.034 (1.270)

12.900 (1.374)

RmBrd 2.162 (.709)

2.404 (.548)

2.404 (.570)

2.468 (.491)

2.527 (.518)

Listwise N = 455; N = 1302 for all other analyses.

70

Extensions

Multiple imputation under non-linearity and interaction - possible but more complex than linear main effects only

Multiple imputation for panel (longitudinal or clustered) data - only available off the shelf in S-PLUS (you can sometimes transform “long” clustered data structure to a “wide” format in which multiple time points are expressed as multiple variables, perform MI, and retransform the imputed data sets into “long” form).

Weighting-based approaches to handle missing data - a promising approach

Non-ignorable situations - rely on a priori knowledge of missingness mechanism Pattern-mixture models Selection models (e.g., Heckman’s model)

71

Conclusions Planning ahead can minimize missing cross-sectional

responses and longitudinal loss to follow-up Use of ad hoc methods can lead to biased results Modern methods are readily available for MAR data

FIML/Direct ML most convenient for models that are supported by available software and when parametric assumptions are met

Multiple Imputation available and effective for remaining situations

Imputation strategies for clustered data and non-linear analyses available, but more complicated to implement

Non-ignorable models are available, but still more complicated and rest on tenuous assumptions