22
DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Handling Missing Data Tao Sun Tao Sun Lena Zhang Lena Zhang Yaqing Chen Yaqing Chen Francisco Aguirre Francisco Aguirre SSC Case Study SSC Case Study 2002 2002

DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

Embed Size (px)

Citation preview

Page 1: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Handling Missing DataHandling Missing Data

Tao SunTao Sun

Lena ZhangLena Zhang

Yaqing ChenYaqing Chen

Francisco AguirreFrancisco Aguirre

SSC Case Study 2002SSC Case Study 2002

Page 2: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 2

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

1.1. Preliminary analysis Preliminary analysis • Various plotsVarious plots

2.2. Assessing the missing patternAssessing the missing pattern• Spearman rank correlation, logistic regression Spearman rank correlation, logistic regression

3.3. Data analysis with missing data - Data analysis with missing data - Multiple ImputationMultiple Imputation• Random hot deck imputation with bootstrapRandom hot deck imputation with bootstrap• PROC MI and MIANALIZE (SAS)PROC MI and MIANALIZE (SAS)• Transcan function (Hmisc library in S plus or R)Transcan function (Hmisc library in S plus or R)

4.4. ConclusionsConclusions5.5. Further workFurther work

Presentation OutlinePresentation Outline

Objective:Objective:

Compare different approaches to handle missing data from a practitioner’s point of view

Page 3: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 3

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Preliminary analysisPreliminary analysis

RESPONSE OVERVIEWRESPONSE OVERVIEW

Sample size: 2389

Males: 1097 (45.9%)

Females: 1292 (54.1%)

Observed: 1691

Missing: 698 (28.8%)

Mean: 0.9129

• The response variable is highly skewed to the left.

0.4 0.6 0.8 1.0

010

020

030

040

050

060

0

HISTOGRAM OF RESPONSE DVHST94

DVHST94

Histogram of observed responses

DVHST94

Page 4: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 4

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Preliminary analysisPreliminary analysis

• 8 covariates in total, first 4 shown here.

• There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5).

• DVBMI94 appears to have some “wild” values ( = 96)– 43 observations , all males.

(3.9% of males sample)– Wild values were replaced

with the mean DVBMI94 of males

– DVBMI94 transformation:NEW.DVBMI94 = abs (DVBMI94 – 22)

Page 5: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 5

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Preliminary analysisPreliminary analysis

• There are no obvious linear patterns between the covariates and the response DVHST94

• DVPP94 is recoded as dichotomous:

NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations)

• The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20

Page 6: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 6

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Preliminary analysisPreliminary analysis

0.84 0.86 0.88 0.90 0.92

DVHST94

309 283 383 296 173 132 61 29 25

857 834

635 259 435 362

535 57 44 305 155 595

1721519

814 491 386

487 433 382 389

424 422 420 425

453 383 433 422

1691

N

2 7

12 17 22 27 32 37 42

Female Male

7

1 2 3 4 5 6

DVPP94 > 0 DVPP94 = 0

0 1

[ 1, 7)

[ 8,10) [10,11]

[2,9]

[ 0, 3) [ 3, 6) [ 6,12) [12,94]

[0.0547,0.447) [0.4473,0.824) [0.8239,1.430) [1.4297,7.445]

[0.0, 1.6) [1.6, 3.1) [3.1, 6.1) [6.1,18.0]

NEW.AGE

SEX

DVHHIN94

DVSMKT94

NEW.DVPP94

NUMCHRON

VISITS

NEW.WT6

NEW.DVBMI94

Overall

mean

N=1691 N missing=698

Mean DVHST94

Page 7: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 7

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Preliminary analysisPreliminary analysis

• Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square

Page 8: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 8

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

• The missing pattern of the response does not appear to depend on the sampling weights

Assessing the missing Assessing the missing patternpattern

Page 9: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 9

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

• The missing values depend on age

Assessing the missing Assessing the missing patternpattern

Missing response DVHST94 vs NEW.AGE

0

100

200

300

400

500

2 7 12 17 22 27 32 37 42NEW.AGE

Sam

ple

size

0%

20%

40%

60%

80%

100%

Per

cent

age

of

mis

sing

total sample size % Missing values

Page 10: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 10

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Assessing the missing Assessing the missing patternpattern

LOGISTIC REGRESSIONLOGISTIC REGRESSIONCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 ***NEW.AGE 0.181625 0.007524 24.140 < 2e-16

***SEXMale -0.847947 0.131475 -6.450 1.12e-10 ***DVHHIN94 0.047828 0.026768 1.787 0.0740 . DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713 . VISITS 0.012483 0.006563 1.902 0.0572 . NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665

Missing response DVHST94 vs Gender

0

500

1000

1500

2000

2500

3000

Male Female Total

Missing

Observed% missing for males: 24%

% missing for females: 34%

Page 11: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 11

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Multiple imputationMultiple imputation

Methods:Methods:

– Random Hot Deck MI with Bootstrap

– SAS PROC MI and PROC MIANALIZE

– Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel)

Page 12: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 12

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Multiple ImputationMultiple Imputation

INCOMPLETE DATA

IMPUTATION ANALYSIS POOLING

IMPUTED DATA

ANALYSIS RESULTS

FINAL RESULTS

 

• IMPUTATION:

Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets.

 

• ANALYSIS:

Analyze each of the B completed data sets using weighted least squares.

 

• POOLING:

Integrate the B analysis results into a final result. Simple rules exist for combining the B analyses.

Page 13: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 13

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Random hot-deck MI with BootstrapRandom hot-deck MI with Bootstrap

B = 1000 B = 1000 replicatesreplicates

Observed MissingObserved Missing

response responseresponse response

Choose randomly with Choose randomly with replacementreplacement

Probability ~ weightsProbability ~ weights

Complete Complete datadata

( ( , ), )(Within variance,R-square)(Within variance,R-square)

1 1U 1R

Same procedureSame procedure ( , )( , )(Within variance ,R-square)(Within variance ,R-square)

1000 1000U 1000R

EstimatedEstimated

EstimatedEstimated

)(~

mean

B

RR

B

ii

1

Compute 95% CI for judging Compute 95% CI for judging significance of predictorssignificance of predictors

)riancebetween va(

)1/())((

riance)(within va/where

1 varianceTotal

1

1

BUUUUBe

BUU

Beb

bU

B

iBiBiB

B

iiB

BB

Page 14: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 14

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

PROC MI & MIANALYZE MethodPROC MI & MIANALYZE Method

PROC MIPROC MI1 By default generates 5 imputation values for each missing valueBy default generates 5 imputation values for each missing value2 Imputation method: MCMC (Markov Chain Monte Carlo)Imputation method: MCMC (Markov Chain Monte Carlo) EM algorithm determines initial valuesEM algorithm determines initial values MCMC repeatedly simulates the distribution of interest from which the MCMC repeatedly simulates the distribution of interest from which the

imputed values are drawnimputed values are drawn3 Assumption: Data follows multivariate normal distributionAssumption: Data follows multivariate normal distribution

PROC REGPROC REG Fits five weighted linear regression models to the Fits five weighted linear regression models to the five complete data sets obtained from PROC MI five complete data sets obtained from PROC MI (used by_imputation_statement )(used by_imputation_statement )

PROC MIANALIZE PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis covariance matrix from the analysis performed on the multiple imputed data sets performed on the multiple imputed data sets and derives valid statistics for the parametersand derives valid statistics for the parameters

Page 15: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 15

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

TRANSCAN(Splus,Hmisc)

Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables.

It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap.

• Draws a sample of size r from r non-missing residuals.

• Chooses a sample of size m from this sample of size r with replacement. m is the number of missing values.

LS Bootstrap Bootstrap

• Generates imputed values with the linear imputation model and the bootstrapped residuals.

Advantage:•Does not need normality assumption or symmetry of residuals.

•Does shrinkage to avoid overfitting

Disadvantage:•“Freezes” the imputation model before drawing the multiple imputations.

)ˆ,...,ˆ( ''

1 r obsobs XY ,. )ˆ,...,ˆ( 1 r )ˆ,...,ˆ( **

1 m

Frank Harrell

This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM.

Page 16: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 16

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Comparing imputation Comparing imputation methodsmethods

Ranking:Ranking:1.1. TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)

2.2. PROC MI (Drawback: normality assumption)PROC MI (Drawback: normality assumption)

3.3. Bootstrap random hot deck (does not use the information of the covariates)Bootstrap random hot deck (does not use the information of the covariates)

S-plus TRANSCAN SAS PROC MI

Bootstrap (random hot

deck)

Available data only

(Intercept) 0.8495 (0.0135) * 0.9281 (0.01) * 0.8711 (0.0128) * 0.861 (0.012) *NEW.AGE -0.0039 (0.0004) * -0.0016 (0.0004) * -0.0006 (0.0002) * -0.0013 (0.0003) *

SEX (Male=1) 0.0045 (0.0045) 0.0023 (0.0045) 0.0031 (0.0055) 0.0037 (0.0049) DVHHIN94 0.0083 (0.0016) * 0.0061 (0.0012) * 0.0029 (0.0007) * 0.0051 (0.0011) *

NEW.DVBMI94 -0.0001 (0.0007) -0.0005 (0.0008) -0.0006 (0.0005) -0.0007 (0.0007) DVSMKT94 0.0012 (0.0014) 0.0009 (0.0013) 0.0019 (0.0008) * 0.0012 (0.0012)

NEW.DVPP94(=0) 0.0904 (0.0085) * 0.0717 (0.0092) * 0.0531 (0.0089) * 0.0686 (0.0081) *NUMCHRON -0.0174 (0.0022) * -0.0123 (0.0023) * -0.0079 (0.0013) * -0.013 (0.0021) *

VISITS -0.0026 (0.0003) * -0.0023 (0.0003) * -0.0017 (0.0002) * -0.0023 (0.0003) *

Mean R-square 0.33 0.193 0.093 0.183

Page 17: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 17

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Significant variablesSignificant variables

Intercept

0.8000

0.8200

0.8400

0.8600

0.8800

0.9000

0.9200

0.9400

S-plusTRANSCAN

SAS PROC MI Random HotDeck

(Bootstrap)

Completeobservations

-

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

0.0140

0.0160

DVHHIN94

0.0000

0.0010

0.0020

0.0030

0.0040

0.0050

0.0060

0.0070

0.0080

0.0090

S-plusTRANSCAN

SAS PROC MI Random HotDeck

(Bootstrap)

Completeobservations

-

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

0.0018 NEW.DVPP94(=0)

0.00000.01000.02000.03000.04000.05000.06000.07000.08000.09000.1000

S-plusTRANSCAN

SAS PROC MIRandom HotDeck

(Bootstrap)

Completeobservations

0.00740.00760.00780.00800.00820.00840.00860.00880.00900.00920.0094

NEW.AGE

-0.0045

-0.0040

-0.0035

-0.0030

-0.0025

-0.0020

-0.0015

-0.0010

-0.0005

0.0000

S-plusTRANSCAN SAS PROC MI

Random HotDeck (Bootstrap)

Completeobservations

-

0.0001

0.0001

0.0002

0.0002

0.0003

0.0003

0.0004

0.0004

0.0005

0.0005

VISITS

-0.0030

-0.0025

-0.0020

-0.0015

-0.0010

-0.0005

0.0000

S-plusTRANSCAN SAS PROC MI

Random HotDeck

(Bootstrap)Complete

observations

-

0.0001

0.0001

0.0002

0.0002

0.0003

0.0003

0.0004

NUMCHRON

-0.0200

-0.0180

-0.0160

-0.0140

-0.0120

-0.0100

-0.0080

-0.0060

-0.0040

-0.0020

0.0000

S-plusTRANSCAN SAS PROC MI

Random HotDeck

(Bootstrap)Complete

observations

-

0.0005

0.0010

0.0015

0.0020

0.0025

Page 18: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 18

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Conclusions about the Conclusions about the missing patternmissing pattern

• The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR.

Page 19: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 19

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Conclusions about Conclusions about multiple imputationmultiple imputation

• Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality.

• Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR.

Page 20: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 20

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Conclusions about the Conclusions about the data analysisdata analysis

• The health status of the population tends decreases with age.

• People with higher income tend to have better health than people with less income.

• People with lower health status demand more medical services (visits to a doctor).

• People that are propense to depression have lower health.

• Smoking does not appear to have a decisive influence on the health status.

Page 21: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 21

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Future workFuture work

• GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses

• Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression).

Page 22: DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

SSC Conference Hamilton Ontario May 2002 22

                           

DEPARTMENT OF MATHEMATICS AND STATISTICS

Acknowledgements:Acknowledgements:

Special thanks to professor Peggy Ng and George Monette for their Special thanks to professor Peggy Ng and George Monette for their support.support.