DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

DEPARTMENT OF MATHEMATICS AND STATISTICS

Handling Missing DataHandling Missing Data

Tao SunTao Sun

Lena ZhangLena Zhang

Yaqing ChenYaqing Chen

Francisco AguirreFrancisco Aguirre

SSC Case Study 2002SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 2


1.1. Preliminary analysis Preliminary analysis • Various plotsVarious plots

2.2. Assessing the missing patternAssessing the missing pattern• Spearman rank correlation, logistic regression Spearman rank correlation, logistic regression

3.3. Data analysis with missing data - Data analysis with missing data - Multiple ImputationMultiple Imputation• Random hot deck imputation with bootstrapRandom hot deck imputation with bootstrap• PROC MI and MIANALIZE (SAS)PROC MI and MIANALIZE (SAS)• Transcan function (Hmisc library in S plus or R)Transcan function (Hmisc library in S plus or R)

4.4. ConclusionsConclusions5.5. Further workFurther work

Presentation OutlinePresentation Outline

Objective:Objective:

Compare different approaches to handle missing data from a practitioner’s point of view



Preliminary analysisPreliminary analysis

RESPONSE OVERVIEWRESPONSE OVERVIEW

Sample size: 2389

Males: 1097 (45.9%)

Females: 1292 (54.1%)

Observed: 1691

Missing: 698 (28.8%)

Mean: 0.9129

• The response variable is highly skewed to the left.

0.4 0.6 0.8 1.0

010

020

030

040

050

060

0

HISTOGRAM OF RESPONSE DVHST94

DVHST94

Histogram of observed responses

DVHST94




• 8 covariates in total, first 4 shown here.

• There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5).

• DVBMI94 appears to have some “wild” values ( = 96)– 43 observations , all males.

(3.9% of males sample)– Wild values were replaced

with the mean DVBMI94 of males

– DVBMI94 transformation:NEW.DVBMI94 = abs (DVBMI94 – 22)




• There are no obvious linear patterns between the covariates and the response DVHST94

• DVPP94 is recoded as dichotomous:

NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations)

• The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20




0.84 0.86 0.88 0.90 0.92

DVHST94

309 283 383 296 173 132 61 29 25

857 834

635 259 435 362

535 57 44 305 155 595

1721519

814 491 386

487 433 382 389

424 422 420 425

453 383 433 422

1691

N

2 7

12 17 22 27 32 37 42

Female Male

7

1 2 3 4 5 6

DVPP94 > 0 DVPP94 = 0

0 1

[ 1, 7)

[ 8,10) [10,11]

[2,9]

[ 0, 3) [ 3, 6) [ 6,12) [12,94]

[0.0547,0.447) [0.4473,0.824) [0.8239,1.430) [1.4297,7.445]

[0.0, 1.6) [1.6, 3.1) [3.1, 6.1) [6.1,18.0]

NEW.AGE

SEX

DVHHIN94

DVSMKT94

NEW.DVPP94

NUMCHRON

VISITS

NEW.WT6

NEW.DVBMI94

Overall

mean

N=1691 N missing=698

Mean DVHST94




• Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square



• The missing pattern of the response does not appear to depend on the sampling weights

Assessing the missing Assessing the missing patternpattern



• The missing values depend on age


Missing response DVHST94 vs NEW.AGE

0

100

200

300

400

500

2 7 12 17 22 27 32 37 42NEW.AGE

Sam

ple

size

0%

20%

40%

60%

80%

100%

Per

cent

age

of

mis

sing

total sample size % Missing values




LOGISTIC REGRESSIONLOGISTIC REGRESSIONCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 ***NEW.AGE 0.181625 0.007524 24.140 < 2e-16

***SEXMale -0.847947 0.131475 -6.450 1.12e-10 ***DVHHIN94 0.047828 0.026768 1.787 0.0740 . DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713 . VISITS 0.012483 0.006563 1.902 0.0572 . NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665

Missing response DVHST94 vs Gender

0

500

1000

1500

2000

2500

3000

Male Female Total

Missing

Observed% missing for males: 24%

% missing for females: 34%



Multiple imputationMultiple imputation

Methods:Methods:

– Random Hot Deck MI with Bootstrap

– SAS PROC MI and PROC MIANALIZE

– Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel)



Multiple ImputationMultiple Imputation

INCOMPLETE DATA

IMPUTATION ANALYSIS POOLING

IMPUTED DATA

ANALYSIS RESULTS

FINAL RESULTS

• IMPUTATION:

Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets.

• ANALYSIS:

Analyze each of the B completed data sets using weighted least squares.

• POOLING:

Integrate the B analysis results into a final result. Simple rules exist for combining the B analyses.



Random hot-deck MI with BootstrapRandom hot-deck MI with Bootstrap

B = 1000 B = 1000 replicatesreplicates

Observed MissingObserved Missing

response responseresponse response

Choose randomly with Choose randomly with replacementreplacement

Probability ~ weightsProbability ~ weights

Complete Complete datadata

( ( , ), )(Within variance,R-square)(Within variance,R-square)

1 1U 1R

Same procedureSame procedure ( , )( , )(Within variance ,R-square)(Within variance ,R-square)

1000 1000U 1000R

EstimatedEstimated

EstimatedEstimated

)(~

mean

B

RR

B

ii

1

Compute 95% CI for judging Compute 95% CI for judging significance of predictorssignificance of predictors

)riancebetween va(

)1/())((

riance)(within va/where

1 varianceTotal

1

1

BUUUUBe

BUU

Beb

bU

B

iBiBiB

B

iiB

BB



PROC MI & MIANALYZE MethodPROC MI & MIANALYZE Method

PROC MIPROC MI1 By default generates 5 imputation values for each missing valueBy default generates 5 imputation values for each missing value2 Imputation method: MCMC (Markov Chain Monte Carlo)Imputation method: MCMC (Markov Chain Monte Carlo) EM algorithm determines initial valuesEM algorithm determines initial values MCMC repeatedly simulates the distribution of interest from which the MCMC repeatedly simulates the distribution of interest from which the

imputed values are drawnimputed values are drawn3 Assumption: Data follows multivariate normal distributionAssumption: Data follows multivariate normal distribution

PROC REGPROC REG Fits five weighted linear regression models to the Fits five weighted linear regression models to the five complete data sets obtained from PROC MI five complete data sets obtained from PROC MI (used by_imputation_statement )(used by_imputation_statement )

PROC MIANALIZE PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis covariance matrix from the analysis performed on the multiple imputed data sets performed on the multiple imputed data sets and derives valid statistics for the parametersand derives valid statistics for the parameters



TRANSCAN(Splus,Hmisc)

Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables.

It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap.

• Draws a sample of size r from r non-missing residuals.

• Chooses a sample of size m from this sample of size r with replacement. m is the number of missing values.

LS Bootstrap Bootstrap

• Generates imputed values with the linear imputation model and the bootstrapped residuals.

Advantage:•Does not need normality assumption or symmetry of residuals.

•Does shrinkage to avoid overfitting

Disadvantage:•“Freezes” the imputation model before drawing the multiple imputations.

)ˆ,...,ˆ( ''

1 r obsobs XY ,. )ˆ,...,ˆ( 1 r )ˆ,...,ˆ( **

1 m

Frank Harrell

This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM.



Comparing imputation Comparing imputation methodsmethods

Ranking:Ranking:1.1. TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)

2.2. PROC MI (Drawback: normality assumption)PROC MI (Drawback: normality assumption)

3.3. Bootstrap random hot deck (does not use the information of the covariates)Bootstrap random hot deck (does not use the information of the covariates)

S-plus TRANSCAN SAS PROC MI

Bootstrap (random hot

deck)

Available data only

(Intercept) 0.8495 (0.0135) * 0.9281 (0.01) * 0.8711 (0.0128) * 0.861 (0.012) *NEW.AGE -0.0039 (0.0004) * -0.0016 (0.0004) * -0.0006 (0.0002) * -0.0013 (0.0003) *

SEX (Male=1) 0.0045 (0.0045) 0.0023 (0.0045) 0.0031 (0.0055) 0.0037 (0.0049) DVHHIN94 0.0083 (0.0016) * 0.0061 (0.0012) * 0.0029 (0.0007) * 0.0051 (0.0011) *

NEW.DVBMI94 -0.0001 (0.0007) -0.0005 (0.0008) -0.0006 (0.0005) -0.0007 (0.0007) DVSMKT94 0.0012 (0.0014) 0.0009 (0.0013) 0.0019 (0.0008) * 0.0012 (0.0012)

NEW.DVPP94(=0) 0.0904 (0.0085) * 0.0717 (0.0092) * 0.0531 (0.0089) * 0.0686 (0.0081) *NUMCHRON -0.0174 (0.0022) * -0.0123 (0.0023) * -0.0079 (0.0013) * -0.013 (0.0021) *

VISITS -0.0026 (0.0003) * -0.0023 (0.0003) * -0.0017 (0.0002) * -0.0023 (0.0003) *

Mean R-square 0.33 0.193 0.093 0.183



Significant variablesSignificant variables

Intercept

0.8000

0.8200

0.8400

0.8600

0.8800

0.9000

0.9200

0.9400

S-plusTRANSCAN

SAS PROC MI Random HotDeck

(Bootstrap)

Completeobservations

-

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

0.0140

0.0160

DVHHIN94

0.0000

0.0010

0.0020

0.0030

0.0040

0.0050

0.0060

0.0070

0.0080

0.0090

S-plusTRANSCAN

SAS PROC MI Random HotDeck

(Bootstrap)


-

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

0.0018 NEW.DVPP94(=0)

0.00000.01000.02000.03000.04000.05000.06000.07000.08000.09000.1000

S-plusTRANSCAN

SAS PROC MIRandom HotDeck

(Bootstrap)


0.00740.00760.00780.00800.00820.00840.00860.00880.00900.00920.0094

NEW.AGE

-0.0045

-0.0040

-0.0035

-0.0030

-0.0025

-0.0020

-0.0015

-0.0010

-0.0005

0.0000

S-plusTRANSCAN SAS PROC MI

Random HotDeck (Bootstrap)


-

0.0001

0.0001

0.0002

0.0002

0.0003

0.0003

0.0004

0.0004

0.0005

0.0005

VISITS

-0.0030

-0.0025

-0.0020

-0.0015

-0.0010

-0.0005

0.0000


Random HotDeck

(Bootstrap)Complete

observations

-

0.0001

0.0001

0.0002

0.0002

0.0003

0.0003

0.0004

NUMCHRON

-0.0200

-0.0180

-0.0160

-0.0140

-0.0120

-0.0100

-0.0080

-0.0060

-0.0040

-0.0020

0.0000


Random HotDeck

(Bootstrap)Complete

observations

-

0.0005

0.0010

0.0015

0.0020

0.0025



Conclusions about the Conclusions about the missing patternmissing pattern

• The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR.



Conclusions about Conclusions about multiple imputationmultiple imputation

• Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality.

• Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR.



Conclusions about the Conclusions about the data analysisdata analysis

• The health status of the population tends decreases with age.

• People with higher income tend to have better health than people with less income.

• People with lower health status demand more medical services (visits to a doctor).

• People that are propense to depression have lower health.

• Smoking does not appear to have a decisive influence on the health status.



Future workFuture work

• GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses

• Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression).



Acknowledgements:Acknowledgements:

Special thanks to professor Peggy Ng and George Monette for their Special thanks to professor Peggy Ng and George Monette for their support.support.

Documents

DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002